Berkeley Open Sources Largest Self-Driving Dataset Every Data Scientist Should Download NOW

Pranav Dar 10 May, 2019 • 3 min read


  • UC Berkeley has open sourced the world’s largest and most diverse self-driving dataset
  • It contains 100,000 video sequences, each approximately 40 seconds long and in 720p quality
  • This dataset is 800 times bigger than Baidu’s ApolloScape!



Self-driving cars are on the verge of transforming the way we travel. However, there have been hiccups along the way which have derailed the initial hype around this field. But with the Andrew Ng backed initiative, and now with Berkeley’s latest release, the perception that autonomous vehicles are unsafe is giving way to positive developments.

UC Berkeley has open sourced the largest and most diverse self-driving dataset for the general public. It is being called ‘BDD100K’ and comes added with rich annotations. You can download it right now here.

As the name suggests, the dataset contains 100,000 video sequences. Each video sequence is about 40 seconds long and is in moderately high definition (720p and 30 frames per second). GPS information, recorded from mobile phones, is also available in these videos to illustrate the rough driving trajectories. These videos were collected from various locations in the United States.

What makes the dataset even more unique and rich to work with is the different weather conditions it has covered, like sunny, overcast, rainy and haze. There is also a good balance between daytime and nighttime scenarios. The annotated images have been divided into two types of lane markings to make them easily distinguishable.

The uses of this dataset extend beyond just building self-driving cars – you can use the data for detecting pedestrians on the roads/pavements. The dataset has over 85,000 instances of pedestrians which make it ideal for this exercise.

As you can see in the image below, their claims of this being the largest ever self-driving dataset are not exaggerated in the slightest. Back in March, we saw Baidu release the largest dataset (at that time) in this domain. Berkeley’s release is 800 times larger than that. It’s 4,800 times bigger than Mapillary’s dataset and an incredible 8,000 times bigger than KITTI (let’s not even compare it to the Cityscapes size!).


Our take on this

I personally think open sourcing datasets like these will massively help the autonomous driving field. At Analytics Vidhya, we have seen a few requests coming in from people asking for self-driving data so this release, coupled with Baidu’s ApolloScape, will go a long way in helping those data scientists.

You can even take part in three challenges set up by Berkeley for this data – Road Object Detection, Drivable Area Segmentation and Domain Adaptation of Semantic Segmentation. So not only do you have enough data to start working on building your own autonomous vehicle, you can even compare your progress with the best data scientists in this domain! What are you waiting for? Download the dataset now and get started!


Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!


Pranav Dar 10 May 2019

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Dmitry Kravchenko
Dmitry Kravchenko 04 Jun, 2018

It doesn't allow to download. It is connecting by HTTP and doesn't allow to connect via unsecure. It also doesn't allow to connect via HTTPS.