- Researchers from UC Berkeley have pioneered a technique that takes a source video and translates the pose to a human subject in a different video
- Pose detection is used to estimate the source human’s pose and mapped to the target human’s appearance
- There are still a few jittery motions that occur, but the overall study has shown a lot of potential and promise
There are certain developments that, every once in a while, tend to re-shape the landscape of how I look at things. DeepMind’s AlphaGo and NVIDIA’s vid2vid approach are just some examples of these occurrences. And now a research released by UC Berkeley has joined this list.
One look at the video of the technique (shown at the end of the article) will show you exactly why. UC Berkeley’s researchers have pioneered a method that transfers motion between human objects in different videos (let that sink in for a few seconds).
The approach requires two videos – one of the target person whose appearance needs to be ‘synthesized’, and the other of the source subject whose dance poses are imposed on the target person. Pose detection is used to estimate the source subject’s movements, which are then mapped accordingly on the target’s appearance. Imagine how complex this is – the two humans are invariably of different shape and size, with different body movements.
The above image illustrates this perfectly. The frame on the top left is the source subject doing the dance moves, the one below shows pose detection, and the right frame shows the motion translated to the target. The accuracy of the motions is incredible. And it’s not just one pose at a time – the poses change dynamically every fraction of a second and the technique does not waver.
Check out the aforementioned video below. Notice the sheer amount of minute details that Berkeley’s technique covers, like wrinkles on the clothes, reflection in the glass, etc.:
Our take on this
I have gone through the video multiple times now and I’m still stunned by the amazing complexity and accuracy of Berkeley’s technique. Earlier this week I covered NVIDIA’s vid2vid technique and thought that was game changing, and here we are, already setting the benchmark and bar ever higher.
It wouldn’t surprise you to know that GANs are at the heart of the technique. Do read through the research paper as well that gives a step-by-step approach the researchers took, and also includes a bunch of useful resources.
Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!
You can also read this article on our Mobile APP