DataHack Radio Episode #1 – The World of Machine Learning Competitions with Kaggle CEO Anthony Goldbloom!
We are incredibly excited to announce the launch of DataHack Radio! This is Analytics Vidhya’s exclusive podcast series which will feature Kunal Jain in conversation with top leaders and practitioners in the data science and machine learning industry.
In every episode of DataHack Radio, we will bring you discussions with one such thought leader in the community. We will have discussions about their journey, their learnings and plenty of other data science related things. We are sure this will be a lot of fun and full of learning for each of you.
Click the below link to listen to the full podcast! The podcast is also available on our Soundcloud page. Happy listening!
In the first ever episode of the DataHack Radio podcast, we host Kaggle’s co-founder and CEO Anthony Goldbloom. He has been featured in Forbes’ ‘Top 30 under 30’ list and has been mentioned in the world’s top journals like the New York Times, Wall Street Journal, among others.
Kunal asked him a variety of questions covering diverse fields – from Kaggle’s history to their future plans, how competitions are planned, current and future trends in machine learning, and much, much more!
Below are the key points that were discussed in the podcast. Enjoy!
Anthony’s Background and how Kaggle was Founded
Anthony was trained as a statistician or econometrician at the Melbourne University in Australia. His first job out of college was at the Australian treasury forecasting GDP, inflation, unemployment, etc. In 2008, he got an internship at the Economist magazine in London.
Kaggle started out as competitions (of course, they do more than that now). They used to take problems from companies and put them up on their website to enable high powered solutions from statisticians and data scientists.
Kaggle’s Vision and Future
“We want Kaggle to be the first place data scientists look at when they wake up and the last thing they work on before shutting off their system.”
Anthony and his team have experimented with different models and approaches since Kaggle was founded. They ultimately want to be the place where data scientists and ML practitioners do all of their work, from projects to learning. The team plans to leverage Kaggle kernels to do this. We will see the addition of TPUs to kernels (hopefully) in the second half of this year to accelerate model processing.
For datasets, they are working towards making it a one stop shop for all kinds of datasets. You won’t have to Google for specific datasets, head over to Kaggle and find it there.
And finally, Kaggle Learn. It has already seen a great response from the community and Kaggle’s team are working to make it more intuitive by adding more topics.
Competitions are about a 1/3rd of the activity on Kaggle. The other three areas are slowly growing and maturing well.
Kaggle also has private kernels and private datasets. If you want to work and collaborate within teams without having to publicly expose your work or data, these two features are invaluable additions.
They have a lot of organizational accounts but no way to administer them yet. This is something the team might look at in the future. Currently, there are a few other things they don’t have on the platform that they expect to add in the foreseeable future.
At this point, Kunal asked Anthony if Kaggle had any plans for making a GUI based interface for making things easier for folks who aren’t so good at programming. As it turns out, Anthony has no such plans for now. Kaggle is focused on people who are good with their platform and there are plenty of growth opportunities for them.
Post acquisition from Google, Kaggle has been swamped with requests for hosting competitions. So how do they deal with these requests and pick and choose competitions?
Turns out they don’t pick and choose specific problems. If the problem is well specified, the data is large enough and it’s a supervised learning problem, they accept it. The only roadblock is the number of requests. The pipeline/queue is getting longer so they are hiring folks to deal with that.
Couple of things we might see in the next 12-18 months – Self serving competitions for researchers and internal company employees. The idea is for these people to set the competitions up themselves for internal purposes so they don’t rely on Kaggle’s team to do that.
We will see a lot more competitions that will require mandatory Kaggle kernels solutions. Topics like constraint and compute, deep learning and reinforcement learning will be the major focus for these.
How do Kaggle competitions relate to the industry?
One of the most pertinent questions is how do the solutions generated through Kaggle competitions get used in real-life industry situations?
To prevent individuals and teams from making overly complex models (like ensembling 5 different models), competitions are usually limited to a maximum of 3 months. Also, team merging is not allowed after a certain point so you can’t merge two completely different approaches at the last minute. It’s a subtle point but extremely powerful.
When winners present their results to the customer, Kaggle asks them to present a full version of their solution along with a simplified version. This helps the client understand what is going on behind the scenes.
“Most of the data scientists in the next 10 years are not data scientists today.”
Anthony emphasized on why Kaggle has made inroads into the learning phase of a data scientist’s journey. With Kaggle Learn and the various webinars they have hosted recently, they want aspiring data scientists to learn from Kaggle and stay within the Kaggle ecosystem to apply what they have learnt.
Anthony’s Views on Open Source Data Science
Anthony believes Kaggle has helped various algorithms and packages gain massive adoption in the wider data science community. XGBoost is the perfect example to illustrate this point. It was pioneered on Kaggle and took off when people moved from using Random Forest to XGBoost to win competitions.
As far as companies using open datasets is concerned, Kaggle is exploring options that enable companies to join existing open source datasets to their own private ones.
What is the future of Data Science and AI?
Anthony believes that companies outside the big giants like Google will have resources to enable them to get their codes and models into production.
The power of AI will eventually lead to a lot of unemployment (accounting, auditing, etc. Basically jobs that are repetitive). The technology is already there, but the products required to enable it are not yet available. Once they are, we should start seeing a shift in the industry.
Rapid Fire Round!
KJ: What is one competition you wanted to host but haven’t been able to yet?
AG: I’m passionate about weather. I would love to host a weather forecasting competition.
KJ: Most satisfying moment in the Kaggle journey?
AG: There are two that come to mind. First, we’ve made machine learning more effective by helping spread which techniques work and which ones don’t. Second, it’s impossible to overfit in Kaggle’s competitions because they divide the test dataset into public and private.
KJ: What is one thing you would want to change in the Kaggle journey?
AG: Probably the name of the company (listen to the podcast to find out why!).
It was a fascinating talk and we learned a lot about Kaggle, the thought and process that drives the popular platform and what plans Anthony has for its future. I highly encourage you to listen to the entire conversation in this podcast!
Did you like this exclusive podcast with Anthony Goldbloom? Leave your suggestions and feedback in the comments below!