Google has Released the Latest Open Images Dataset! Every Data Scientist should Work with this
- Open Images is a massive dataset which contains close to 9 million images
- All images come with labels that were prepared manually by professional annotators
- The dataset is divided into the training (9 million+ images), validation (41k+ images), and test (125k+ images) set
- Google has also announced an object detection challenge for data scientists
As a data scientist, finding large datasets to work with is a challenge. Most organizations treasure their data and prefer not releasing it to the community. But Google has been one of the few who has consistently open sourced a lot of their research in order to speed up studies and also help budding data scientists.
This week, they have released version 4 of their popular Open Images dataset – free and available for anyone to download and work with.
Open Images is a massive dataset of images which was released by Google back in 2016. The dataset consists of 9 million images that have already been labelled by the team. According to their site, “The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations”.
These annotations have been drawn manually by professional annotators in order to ensure accuracy and consistency. The subject matter in the images is diverse in nature. There are 8.4 objects per image on average in this dataset. To add the icing on the cake, the data is annotated with image-level labels that span thousands of classes!
The Open Images dataset is pre-split into the training, validation and test sets. The training set contains 9,011,219 images, the validation set has 41,260 images and the test set has 125,436 images. All of these images come with proper labels to help you get down to building a model as quickly as possible.
Along with this dataset release, Google has announced the ‘Open Images Challenge 2018’. This is scheduled to be held at the European Conference on Computer Vision and will be an object detection challenge. This latest competition is offering a far more broader range of object classes than any previous challenge. It will have two tracks:
- Object Class Detection: predicting a tight bounding box around all instances of the 500 classes
- Visual Relationship Detection: detecting pairs of objects in particular relations, e.g. “woman playing guitar”. This is done by adding large number of images with multiple object annotations
The deadline for submission of results is 1st September, 2018. The evaluation metric for this challenge will be mean Average Precision (mAP) over the given 500 classes.
This is the fourth update the team has released in the last 2 years. You can download the dataset from Google’s page here.
Our take on this
This is a treasure trove for data scientists! Anyone interested in deep learning and image classification can download and work on this dataset. The fact that Google has worked on labelling the images is a testament to their team and to the power of their resources. The training set, with it’s massive size, is expected to stimulate research on more complex detection models. The hope is that this release will help in improving current state-of-the-art models.
Their open challenge is already generating a huge buzz in the ML community and we are expecting to see some serious competition. We will be sure to cover any major projects that come up in this challenge.
If you’re a newcomer to image processing, or have been working in this field for a while, this dataset is perfect for you. Use the comments section below to tell us how you plan on using this!
Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!