NVIDIA Open Sourced a Video-to-Video Translation Technique using PyTorch – and it is Super Impressive

Pranav Dar 07 May, 2019 • 3 min read

Overview

Researchers from NVIDIA have pioneered a novel approach that does video-to-video translation
They have released a PyTorch implementation of the technique on GitHub
The PyTorch code can be used for multiple scenarios, including generating human bodies from given poses!

Introduction

Progress in the field of deep learning and reinforcement learning relies on our capability to recreate the dynamics of real-world scenarios in a simulation environment. I have previously written about an algorithm that transforms images into a completely different category, and another technique that fixes corrupt images in the blink of an eye. Progress, at least in the image processing field, has been constant and promising.

But research in the area of video processing has been painstakingly difficult. For example, can you take a video sequence and predict what will happen in the next frame? It’s been explored, but not to any great avail. At least until now.

NVIDIA, already leading the way in using deep learning for image and video processing, has open sourced a technique that does video-to-video translation with impressive results. The goal of this research, as described by the researchers in their paper, is to learn a mapping function from a given input video in order to produce an output video which depicts the contents of the input video with incredible precision (as you can see in the above GIF).

They have released the code on GitHub, which is a PyTorch implementation of the technique for a high resolution translation of videos. This code can currently be used for:

Converting semantic labels into realistic real-world videos
Creating multiple outputs for synthesizing people talking from edge maps
Generating a human body from a given pose (not just the structure, but the entire body!)

The above image is a wonderful illustration of different models (or techniques) used to perform the same task. On the top left is the input source video. Adjacent to that is the pix2pixHD model, the state-of-the-art image-to-image translation approach. On the bottom left is the COVST model and on the bottom right is NVIDIA’s vid2vid technique.

You can browse through the below links to read more about this novel technique and even implement it on your own machine:

GitHub Repository (with Python code)
Multiple videos demonstrating this technique
Research paper (Work in progress – the final version is expected to be released imminently)

Also, be sure to check out the below video which encapsulates all that the open sourced PyTorch code can do:

Our take on this

If you were impressed with our last NVIDIA article on converting a standard video into slow-motion, this latest research will leave you stunned. And it’s not just limited to recreating real-world scenarios, it can even predict what will happen in the next few frames! When compared to baseline models like PredNet and MCNet, the vid2vid model produced far superior results.

There are still a few issues with the model like not being able to map a turning car, but these will be overcome in due course. If this field of research interests you, go through the research paper I linked above and also download the PyTorch code and try to replicate the technique on your own end.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar 07 May 2019

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

AVbytes