Shivani Sharma — August 29, 2021
Advanced Computer Vision Image Programming Project Python

This article was published as a part of the Data Science Blogathon

Introduction

Hello everyone! While cyberpunk has not yet entered our lives that much, and neuro interfaces are far from ideal, LiDAR can become the first stage on the path to the future of manipulators. Therefore, in order not to get bored during the holidays, I decided to fantasize a little about the controls of a computer and, presumably, any device, up to an excavator, spaceship, drone, or stove.

The main idea is to move the mouse, moving not the whole hand, but only the index finger, which will allow you to run through the menu without taking your hands off the keyboard, press buttons, and, together with hotkeys, turn into a real keyboard ninja! What happens if you add swiping or scrolling gestures? I think there will be a bomb! But until this moment we still have to wait a couple of years)

Let’s start assembling our prototype of the manipulator of the future

 

What you need:

  1. Camera with LiDAR Intel Realsense L515.

  2. Ability to program in python

  3. Just a little remember school mathematics

  4. Mount for the camera on the monitor aka tripod

We attach the camera to a tripod with aliexpress, it turned out to be very convenient, lightweight and cheap)

image | Hand Pose Estimation LiDAR

SOURCE

image 2 | Hand Pose Estimation LiDAR

SOURCE

We figure out how and on what to make a prototype

There are many approaches to accomplishing this task. You can train the detector or hand segmentation yourself, cut out the resulting image of the right hand and then apply this wonderful repository from Facebook research to the image, get an excellent result or make it even easier.

To use the media pipe repository, after reading this link, you can understand that this is one of the best options for today.

Firstly, everything is already there out of the box – installation and launch will take 30 minutes, taking into account all the prerequisites.

Secondly, thanks to a powerful development team, they not only take the State Of Art in Hand Pose Estimation but also provide an easy-to-understand API.

Third, the network is ready to run on the CPU, so the entry threshold is minimal.

Probably, you will ask why I didn’t come here and did not use the repositories of the winners of this competition. In fact, I studied their solution in some detail, they are quite prod-ready, no stacks of millions of grids, etc. But the biggest problem, it seems to me, is that they work with depth images. Since these are academics, they did not hesitate to convert all the data through Matlab, in addition, the resolution in which the depths were filmed seemed small to me. This could have a profound effect on the result. Therefore, it seems that the easiest way is to get the key points in the RGB picture and take the value along the Z-axis in the Depth Frame by the XY coordinates. Now the task is not to optimize something much, so we will do it as it is faster from a development point of view.

Remembering school mathematics

As I already wrote, in order to get the coordinate of the point where the mouse cursor should be, we need to build a line passing through two key points of the phalanx of the finger, and find the point of intersection of the line and the plane of the monitor.

image plane | Hand Pose Estimation LiDAR

SOURCE

The picture shows schematically the plane of the monitor and the line that intersects it. You can look at the mathematics here.

Using two points, we get a parametric representation of a straight line in space.

parametric representation | Hand Pose Estimation LiDAR
variables

I will not focus too much on the school mathematics curriculum.

Installing a library for working with a camera

This is perhaps the hardest part of this job. As it turned out, the software for the camera for Ubuntu is very crude, liberal sense is simply littered with all sorts of bugs, glitches, and dances with a tambourine.

Until now, I have not been able to defeat the strange behaviour of the camera, sometimes it does not load parameters at startup.

The camera works only once after restarting the computer !!! But there is a solution: before each launch, do a software hard reset of the camera, resets the USB, and maybe everything will be fine. By the way, for Windows 10 everything is fine there. It’s strange that the developers imagine robots based on Windows =)

To get real sense under Ubuntu 20, do this:

$ sudo apt-get install libusb-1.0-0-dev
Then rerun cmake and make install. Here is a complete recipe that worked for me:
$ sudo apt-get install libusb-1.0-0-dev
$ git clone https://github.com/IntelRealSense/librealsense.git
$ cd librealsense/
$ mkdir build && cd build

Having collected from sorts, it will be more or less stable. A month of communication with technical support revealed that you need to install Ubuntu 16 or suffer. I chose it yourself, you know what.

We continue to understand the intricacies of the neural network

 

Now let’s see another video of the finger-and-mouse operation. Please note that the pointer cannot stand in one place and, as it were, floats around the intended point. At the same time, I can easily direct it to the word I need, but with a letter, it is more difficult, I have to carefully move the cursor:

This, as you understand, is not shaking my hands, on holidays I drank only one mug of New England DIPA =) It’s all about constant fluctuations of key points and Z-coordinates based on the values ​​obtained from the lidar.

Let’s take a closer look:

In our SOTA from media pipe, there are certainly fewer fluctuations, but they also exist. As it turned out, they are struggling with this by using prokid vaniya from the past frame heatmap in the current frame and train network – it gives more stability, but not 100%.

Also, it seems to me, the specificity of the markup plays a role. It is hardly possible to make the same markup on such a number of frames, not to mention the fact that the resolution of the frame is different everywhere and not very large. Also, we do not see the flickering of light, which, most likely, is not constant due to the different periods of operation and the amount of exposure of the camera. And the network also returns a sandwich from the heatmap equal to the number of key points on the screen, the size of this tensor is BxNx96x96, where N is the number of key points, and, of course, after threshold and resize to the original frame size, we get what we get (

Example of rendering heatmap:

hand pose estimation | Hand Pose Estimation LiDAR

SOURCE

Code Review

All the code is in this repository and is very short. Let’s take a look at the main file and see the rest for yourself.

import cv2
import mediapipe as mp
import numpy as np
import pyautogui
import pyrealsense2.pyrealsense2 as rs
from google.protobuf.json_format import MessageToDict
from mediapipe.python.solutions.drawing_utils import _normalized_to_pixel_coordinates
from pynput import keyboard
from utils.common import get_filtered_values, draw_cam_out, get_right_index
from utils.hard_reset import hardware_reset
from utils.set_options import set_short_range
pyautogui.FAILSAFE = False
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands
# Hand Pose Estimation
hands = mp_hands.Hands(max_num_hands=2, min_detection_confidence=0.9) 
def on_press(key):
if key == keyboard.Key.ctrl:
        pyautogui.leftClick()
if key == keyboard.Key.alt:
        pyautogui.rightClick()
def get_color_depth(pipeline, align, colorizer):
    frames = pipeline.wait_for_frames(timeout_ms=15000) # waiting for a frame from the camera
    aligned_frames = align.process(frames) 
    depth_frame = aligned_frames.get_depth_frame()
    color_frame = aligned_frames.get_color_frame()
if not depth_frame or not color_frame:
return None, None, None
    depth_ima = np.asanyarray(depth_frame.get_data())
    depth_col_img = np.asanyarray(colorizer.colorize(depth_frame).get_data())
    color_image = np.asanyarray(color_frame.get_data())
    depth_col_img = cv2.cvtColor(cv2.flip(cv2.flip(depth_col_img, 1), 0), cv2.COLOR_BGR2RGB)
    color_img = cv2.cvtColor(cv2.flip(cv2.flip(color_img, 1), 0), cv2.COLOR_BGR2RGB)
    depth_img = np.flipud(np.fliplr(depth_img))
    depth_col_img = cv2.resize(depth_col_img, (1280 * 2, 720 * 2))
    col_img = cv2.resize(col_img, (1280 * 2, 720 * 2))
    depth_img = cv2.resize(depth_img, (1280 * 2, 720 * 2))
return color_image, depth_color_image, depth_image
def get_right_hand_coords(color_image, depth_color_image):
    color_image.flags.writeable = False
    results = hands.process(color_image)
    color_image.flags.writeable = True
    color_image = cv2.cvtColor(color_image, cv2.COLOR_RGB2BGR)
    handedness_dict = []
    idx_to_coordinates = {}
    xy0, xy1 = None, None
if results.multi_hand_landmarks:
for idx, hand_handedness in enumerate(results.multi_handedness):
            handedness_dict.append(MessageToDict(hand_handedness))
        right_hand_index = get_right_index(handedness_dict)
if right_hand_index != -1:
for i, landmark_list in enumerate(results.multi_hand_landmarks):
if i == right_hand_index:
                    image_rows, image_cols, _ = color_image.shape
for idx, landmark in enumerate(landmark_list.landmark):
                        landmark_px = _normalized_to_pixel_coordinates(landmark.x, landmark.y,
                                                                       image_cols, image_rows)
if landmark_px:
                            idx_to_coordinates[idx] = landmark_px
for i, landmark_px in enumerate(idx_to_coordinates.values()):
if i == 5:
                    xy0 = landmark_px
if i == 7:
                    xy1 = landmark_px
break
return col_img, depth_col_img, xy0, xy1, idx_to_coordinates
def start():
    pipeline = rs.pipeline() # initialize librealsense
    config = rs.config() 
    print("Start load conf")
    config.enable_stream(rs.stream.depth, 1024, 768, rs.format.z16, 30)
    config.enable_stream(rs.stream.color, 1280, 720, rs.format.bgr8, 30)
    profile = pipeline.start(config) 
depth_sensor = profile.get_device (). first_depth_sensor ()
    set_short_range (depth_sensor) # load parameters for working at a short distance
    colorizer = rs.colorizer ()
    print ("Conf loaded")
    align_to = rs.stream.color
    align = rs.align (align_to) # combine depth map and color image
    try:
        while True:
            col_img, depth_col_img, depth_img = get_col_depth (pipelin, align, colorize)
            if color_img is None and color_img is None and color_img is None:
                continue
            color_img, depth_col_img, xy00, xy11, idx_to_coordinates = get_right_hand_coords (col_img,
                                                                                                 depth_col_img)
            if xy00 is not None or xy11 is not None:
                z_val_f, z_val_s, m_xy, c_xy, xy00_f, xy11_f, x, y, z = get_filtered_values ​​(depth_img, xy00, xy11)
                pyautogui.moveTo (int (x), int (3500 - z)) # 3500 hardcode specific to my monitor
                if draw_cam_out (col_img, depth_col_img, xy00_f, xy11_f, c_xy, m_xy):
                    break
    finally:
        hands.close ()
        pipeline.stop ()
hardware_reset () # reboot the camera and wait for it to appear
listener = keyboard.Listener (on_press = on_press) # set a listener for keyboard button presses
listener.start ()
start () # start the program

I didn’t use classes or streams, because, for such a simple case, it is enough to execute everything in the main thread in an endless while loop.

At the very beginning, the media pipe, the camera are initialized, the camera settings for the short-range and auxiliary variables are loaded. Next comes the magic called “alight depth to color” – this function matches each point from the RGB image, a point on the Depth Frame, that is, it gives us the opportunity to get the XY coordinates, the Z value. It is understood that it is necessary to calibrate on your monitor … I deliberately did not pull out these parameters separately, so that the reader who decided to run the code would do it himself, at the same time it will be reused in the code)

Next, we take from the entire prediction only points numbered 5 and 7 of the right hand.

hand points

The only thing left to do is to filter the obtained coordinates using a moving average. It was possible, of course, to apply more serious filtering algorithms, but after looking at their visualization and pulling various levers, it became clear that a moving average with a depth of 5 frames would be enough for the demo, I want to note that for XY, 2-3 frames were enough. but things are worse with Z.

deque_l = 5
x0_d = collections.deque(deque_l * [0.], deque_l)
y0_d = collections.deque(deque_l * [0.], deque_l)
x1_d = collections.deque(deque_l * [0.], deque_l)
y1_d = collections.deque(deque_l * [0.], deque_l)
z_val_f_d = collections.deque(deque_l * [0.], deque_l)
z_val_s_d = collections.deque(deque_l * [0.], deque_l)
m_xy_d = collections.deque(deque_l * [0.], deque_l)
c_xy_d = collections.deque(deque_l * [0.], deque_l)
x_d = collections.deque(deque_l * [0.], deque_l)
y_d = collections.deque(deque_l * [0.], deque_l)
z_d = collections.deque(deque_l * [0.], deque_l)
def get_filtered_values(depth_image, xy0, xy1):
global x0_d, y0_d, x1_d, y1_d, m_xy_d, c_xy_d, z_val_f_d, z_val_s_d, x_d, y_d, z_d
    x0_d.append(float(xy0[1]))
    x0_f = round(mean(x0_d))
    y0_d.append(float(xy0[0]))
    y0_f = round(mean(y0_d))
    x1_d.append(float(xy1[1]))
    x1_f = round(mean(x1_d))
    y1_d.append(float(xy1[0]))
    y1_f = round(mean(y1_d))
    z_val_f = get_area_mean_z_val(depth_image, x0_f, y0_f)
    z_val_f_d.append(float(z_val_f))
    z_val_f = mean(z_val_f_d)
    z_val_s = get_area_mean_z_val(depth_image, x1_f, y1_f)
    z_val_s_d.append(float(z_val_s))
    z_val_s = mean(z_val_s_d)
    points = [(y0_f, x0_f), (y1_f, x1_f)]
    x_coords, y_coords = zip(*points)
    A = np.vstack([x_coords, np.ones(len(x_coords))]).T
    m, c = lstsq(A, y_coords)[0]
    m_xy_d.append(float(m))
    m_xy = mean(m_xy_d)
    c_xy_d.append(float(c))
    c_xy = mean(c_xy_d)
    a0, a1, a2, a3 = equation_plane()
    x, y, z = line_plane_intersection(y0_f, x0_f, z_v_s, y1_f, x1_f, z_v_f, a0, a1, a2, a3)
    x_d.append(float(x))
    x = round(mean(x_d))
    y_d.append(float(y))
    y = round(mean(y_d))
    z_d.append(float(z))
    z = round(mean(z_d))
return z_v_f, z_v_s, m_xy, c_xy, (y00_f, x0_f), (y11_f, x1_f), x, y, z

We create a deque with a length of 5 frames and average everything in a row =) Additionally, we calculate y = mx + c, Ax + By + Cz + d = 0, the equation for the straight line is the ray in the RGB picture and the equation of the monitor plane, we get it y = 0.

Conclusion

Well, that’s all, we sawed down the simplest manipulator, which, even with its dramatically simple execution, can already be used, albeit with difficulty, in real life!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *