- Researchers have developed a model, called CRAFT, that generates cartoons based on text descritpions
- The final model was trained on more then 25,000 video clips, each of three seconds length
- The final results are mostly positive but can become a mess if not labelled properly
Creating animated videos is a long and arduous process. Even with the introduction of computers and software in the animation industry, the task still takes quite a lot of time to accomplish. But with the advancements in AI, this time problem might just have been solved.
Researchers from the University of Illinois and the Allen Institute for Artificial Intelligence have developed an AI model, called CRAFT (Composition, Retrieval, and Fusion Network), that takes text descriptions (or captions) from the user and generates scenes from ‘The Flintstones’ cartoon series. And unlike pixel generation approaches, this model is based on text to entity segment retrieval from a video database.
The final model was developed by training it on set of more than 25,000 video clips, each three seconds and 75 frames long. As you can imagine, each video had to be labelled and annotated with which character(s) was in the scene and what the scene was about.
The AI matches videos to the words descriptions and builds a set of parameters. CRAFT can convert the provided text descriptions into video clips of this animated series, featuring characters, props and location, as it learned from the videos. It can not only put the characters into place, but also parse objects, retrieve the background, etc.
The results produced are still raw in nature. As you’ll see in the video posted below, the AI does get things right most of the time but when it gets it wrong, the video looks like a mess. Safe to way this is a work in progress, albeit with a massive amount of potential.
Below is a video that gives you a glimpse of how CRAFT works:
I would recommend reading their research paper here to gain a deeper understanding of the algorithm.
Our take on this
The Flintstones comes from an old school animation line, with relatively static backgrounds. Animation has since come a long way – the style of generating videos is far more dynamic in nature. And this will be a challenge for researchers in the video generation field going forward.
One thought is that by providing the mode more complex video frames, it could be made to adapt to the dynamic parts of the video as well. What do you think is required to improve this algorithm? Let us know in the comments section below!
Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!