Kaggle Grandmaster Series – Notebooks Grandmaster and Rank #2 Dan Becker’s Data Science Journey!

avcontentteam 11 Dec, 2020 • 9 min read

” If I were explicitly trying to be in the top 1%, I might have given up before I got there. It’s such a hard goal that I’d have given up thinking I’d never get there”- Dan Becker

I am pretty sure this is a throwback moment for many of you when you dropped off from your Kaggle journey. Not able to reach the top 1% and think there is no meaning of you practicing.

Well, the Kaggle Grandmaster series is back with yet another interview, and this time we have Dan Becker with us.

data science interview dan becker

Dan is a Kaggle Notebooks Grandmaster and currently holds the 2nd rank in this criterion. His notebooks are not only widely referred to by DS beginners but they also are a part of free courses in Kaggle learn He is also a Kaggle Datasets and Discussions Expert.

Dan is the Founder of a Company called Decision.AI that helps data scientists translate their AI models into optimal business results. Before this has worked as a Data Science with Google too! Pretty amazing, right?

In this interview, we cover a range of topics, including:

Dan Becker’s Transition from Economics to Data Science
Dan’s Kaggle Journey from Scratch to becoming a Grandmaster
Dan’s Advice to the Beginners in Data Science

So, go through this interview and absorb all you can!

Dan Becker’s Education and Work

data science interview work

Analytics Vidhya (AV): Your educational background involves a Ph.D. in Econometrics. Could you please tell us how you transitioned from economic to DS and what were the challenges you faced in this journey?

Dan Becker(DB): I started the transition to DS after reading a newspaper article about a Kaggle competition with a $3Million grand prize. I made a submission using conventional econometric techniques, and I was in the bottom 10% of the leaderboard. I still remember the bad feeling in my stomach when I first saw that result. I thought I was so good at modeling, and it was hard to accept that I was at the bottom. But it inspired me to learn and improve. Each night for the next year, I would improve my submission or learn more about machine learning. I moved up a few spots at a time, and I finished in 2nd place out of 1353 teams in that competition.

By the end, I had really completed my transition to becoming a data scientist.

AV: You worked as a DS at one of the best companies in the world – Google! What kind of skill-sets knowledge does it take to land a Data Scientist role at such a big company?

DB: It varies even from one person at Google to the next depending on the exact role.

I’d been a data scientist for 7 or 8 years by the time I joined Google. It’s part of my personality to always worry about falling behind, so I’ve never stopped learning. As a result, I had a pretty broad understanding of data science topics. In terms of the job interview itself, Google loves algorithms questions. The book “Cracking the Coding Interview” is the best resource for job interviews at a lot of these big tech companies.

AV: Post Kaggle, you founded Decision.ai, a tool to help data scientists to translate their AI models into optimal business results. Could you elaborate on how an AI model translates to business models?

DB: Decision AI is a tool for analysts and data scientists to help get more business value from the machine learning models they already build. Supervised machine learning models make predictions, but there’s typically very little rigor around how those predictions are used. Let me give you one example:

A data scientist builds a model that predicts which financial transactions are fraudulent. For one example transaction, the model says it’s 5% likely to be a fraud. Now the question is what do you do about this. Some people use a simple threshold, perhaps rejecting all transactions that are at least 10% likely to be a fraud.

The way you translate a prediction into a real-world action is called the “decision function.” Now the question is “what’s the optimal decision function. For each transaction, you might want to account for how valuable the customer, because that informs how bad it is to inconvenience them with a denied transaction. You’d want to compare that to the cost of accepting a fraudulent transaction, which might depend on the transaction amount.

So there’s all this business context you should consider in your decision function. We can’t automate finding the exact decision function. But we provide a tool so data scientists can rigorously optimize how they make these decisions.

This isn’t unique to fraud. We see it in supply chain management, preventive maintenance, pricing, health care, and other places.

Across a lot of use cases, people are surprised when they realize how much better they can get at making decisions. In many cases, they start by thinking this isn’t the data scientists’ job, and someone else should do it. But then they use our tool and realize how much profit they can add by being rigorous with decision optimization, even if it involves collaboration with other stakeholders.

Dan’s Kaggle Journey from Scratch to becoming a Grandmaster

AV: You’re a Kaggle Notebooks Grandmaster and currently ranked 2, first of all, hats off to you. This is beyond Amazing!! Here’s a question a LOT of people would love to know about: What is your framework and strategy for creating an expert-level notebook? Is there a check-list?

DB: I don’t have a checklist. A lot of my notebooks are featured in Kaggle Learn courses, and that’s partly responsible for the attention they get.

In general, I divide notebooks into two categories:

One category of notebooks is educational. Those should be about a specific technique. For example, you could do a notebook about how to use Seaborn for data visualization. In that, I wouldn’t add a bunch of pandas or scikit-learn stuff, because other materials are just distracting.

Ideally, the notebook would explain your mental model for seaborn, rather than being just a long list of examples. That way, after reading your notebook, I could figure out how to do things for myself.

The second type of notebook is curiosity-driven. These will usually get fewer votes, but I personally like them. For example, I might wonder what the trend is in wildfires over time. I find a dataset, and then make a couple of graphs to start answering that question. Usually, in the first graph, I will bring up new questions. So I’ll create more graphs to answer those.

AV: That’s just amazing Dan. Now, what were the challenges you faced initially when started Kaggling and how did you overcome them?

DB: Initially, my challenge was that I wasn’t very good. I didn’t expect to end up in the top 1%, but I enjoyed improving. That helps me keep working every day. If I were explicitly trying to be in the top 1%, I might have given up before I got there. It’s such a hard goal that I’d have given up thinking I’d never get there.

Kaggle competitions also have more top competitors now when I started 10 years ago. I don’t think that’s a great path to professional growth for most people, and finding a community of people you can learn with seems more promising to me.

AV: You currently have more than 180 notebooks which are widely referred to by DS beginners. Did you plan to focus on Notebooks specifically, and what are your criteria to choose a topic for a notebook?

DB: I created some notebooks for the free courses at Kaggle Learn, so many of my notebooks are directly from that. Now that I’m not doing that, my notebooks are almost always driven by a curiosity about a real-world question.

AV: Since 180+ is a huge number, so which 5 notebooks are your favorite that you would recommend to our community?

DB: I created a machine learning explainability course at https://www.kaggle.com/learn/machine-learning-explainability and those are easily my favorite notebooks

AV: Since you’ve seen Kaggle grow from the start to what it is now, can you tell us a couple of milestones that you felt were a key part of your journey?

DB: Finishing in 2nd place in the Heritage Health Prize is easily my biggest personal milestone.

I also competed in the first competition that used deep learning techniques. This was before tools like Keras, PyTorch or TensorFlow existed. I used a library called PyLearn2. I also made my first couple of open-source contributions to PyLearn2 as part of doing that competition

Dan’s Advice to the Beginners in Data Science

data types MySql

AV: As an industry-leader in DS and ML, what advice would you give to beginners so that they can excel in the industry?

DB: I think it’s a mistake to learn a lot of theory first and then start doing projects. I see people who have spent years becoming data scientists and they still don’t know much about how things work in practice.

Instead, I’d favor learning the bare minimum you need to try a project like a Kaggle competition. Then learn more theory after you have the practical experience to understand where theory fits in.

Also, you absolutely need to learn how to use Git and to collaborate with other people.

Finally, learn to use Pandas well.

Most data scientists spend 10X more time manipulating and cleaning data than they do with fancy algorithms. Deep learning may be fun, but Pandas is more practically useful.

Most people I know who are trying to hire data scientists have lamented the shortage of data scientists who can work quickly with Pandas.

AV: Kaggle is widely used and accepted as a stepping stone to become a successful DS. What advice would you give to beginners so that they can fully leverage this platform?

DB: Some people come to Kaggle with the goal of achieving a certain rank to help them get a job. That approach is a mistake. Rankings won’t get you a job unless you win a competition or get very close to it. But 99.9% of participants won’t achieve that.

Fortunately, Kaggle is a great place to learn. I’d emphasize learning from others. Team up with people in competitions, or share your notebooks broadly to get feedback and advice from others.

Find datasets about topics you find interesting and create your own projects to share. Kaggle’s probably the best place in the world to learn by doing. If you don’t think you are ready for that, start with the courses on Kaggle Learn.

AV: It is usually seen that people participate in hackathons and even yield good results but when it comes to translating that into the industries/business, most people struggle with that. So based on your experience what advice would you give to them to overcome this gap?

DB: That’s a hard but important question. There are many parts to success in solving business problems that you don’t deal with in a hackathon or hobby project. If you can do it, getting a data science or analytics job will help by exposing you to these issues. I think that should be your #1 goal.

Aside from that, for each project, you should spend a little time understanding how decisions are made today and how you can help. If a decision is made by a person, you might start by creating some graphs they’d find useful. Then see if you can send them those graphs and start a conversation. This is less fun than making a machine learning model. But you know that no one is going to engage in a conversation with you based on your emailing them a model. So I’d just try to get closely involved with real decision-making processes. That’s still hard to do though.

AV: You’re someone who’s work everyone looks forward to. Can you name five Data Science experts whose work motivates you?

DB: I expect reinforcement learning will be hugely impactful in the future (even if it isn’t yet), so I enjoy reading about Sergey Levine’s research. It’s a little research-focused, but the BAIR blog is one of my favorites.

I hugely respect Thomas Wiecki and everyone who is making Bayesian approaches more broadly usable.

I collaborated with Tim Salimans in a Kaggle competition. He’s absolutely brilliant. We haven’t stayed in touch, but I’m always excited when I see his research.

Susan Athey is combining econometrics and machine learning in a way that I absolutely love.

Andrew Gelman is very insightful about how to use data. He’d call himself a “statistician,” but I don’t think the distinction between statistics and data science is very useful.

End Notes

That was a pretty heavy and inspirational interview. We hope you were able to absorb things said in this interview and it helps you in the course of your data science journey.

This is the seventh interview in the Kaggle Grandmaster Series. We recommend you go through some of the previous interviews too-

What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!

avcontentteam 11 Dec 2020

Analytics Vidhya Beginner Interview Questions Interviews Profile Building