Have you ever wondered how Coursera uses data science? Sure we have all taken (or heard of) their courses but how does the platform make use of the amount of data that’s generated? How does their pricing strategy work for certificates? How is their data science team structured? What tools and techniques do they use?
These are just some of the questions answered in this phenomenal podcast. We had the pleasure of hosting Coursera’s Head of Data Science, Emily Glassberg Sands, who gave us a detailed and thorough explanation of how Coursera functions behind the scenes. It makes for very fascinating listening and you will learn a whole bunch of data science and machine learning stuff throughout.
In this article, we have listed a few snippets from Kunal’s conversation with Emily. Click the above SoundCloud button to listen to the podcast!
You can subscribe to DataHack Radio or listen to previous episodes on any of the below platforms:
Emily joined Coursera over four years ago after successfully completing her Ph.D in Economics from Harvard University. Before that, she completed her undergraduate degree from Princeton University in Economics as well. She co-authored two papers during her Ph.D which make for very interesting reads (they are available on her LinkedIn profile and are very relevant for data scientists).
She had not initially applied to be a data scientist at Coursera, instead opting to apply to the Partnerships division. She liked Coursera’s mission and during her initial few months there, she worked with various machine learning folks (half of the original team was comprised of machine learning researchers from Stanford). Emily was the first data science hire at Coursera and is now their Head of Data Science!
Her research during her graduation years was exclusively in MATLAB. She trained herself in R and Python using books (courses back then weren’t as good as they are now) before joining Coursera and learned quite a lot on the job as well.
Coursera’s Data Science Team
Emily elaborated on Coursera’s growth since she joined in 2014. The platform has seen a rapid rise in both the number of users and the amount of content they have. The team has grown to over 300 people across the organization, with the product and engineering department making up half of that number.
Emily’s data science team is currently comprised of the below three roles, along with their functions:
- Data engineers: Building the warehouse and core data lake, internal self-survey analytics, analytics products
- Decision scientists: Modeling and experimentation
- Data scientists: Focused on data products and how to leverage them for Coursera’s business model
So which language(s) does Coursera’s data science team use for day-to-day tasks? R and Python! Everyone on the team is expected to know and work with these two popular languages.
Techniques and Approaches used by the Data Science Team
For the content discovery aspect that we see on Coursera’s platform, it’s important for the data science team to establish the right metadata (understanding features of the content and Coursera’s users). This is followed by building relatively lightweight logistic regression models to recommend the best content to the end user. There are also a few black box models which the team experiments with.
But the biggest boon has been what Emily calls “skills graphs” or “knowledge graphs”. This contains a bunch of metadata about content and learners. When you combine these two, you get very powerful predictions that Coursera leverages in it’s end product. Building the graph however, is a bit complex. It maps about 40,000 skills to the available content types and this is what powers various machine learning models that work on recommendations.
The team has been working on this knowledge graph since almost two years. At this point Emily offered some valuable insights into what a data scientist should expect – it wasn’t a one step process. They built out one part of the knowledge graph, tried it out in a practical situation, and then went ahead with building the rest, piece by piece. If you are interested in NLP or just want to know how a data science project functions from scratch, you should listen to this section in the podcast.
Collaborative filtering, one of the most popular recommendation techniques, was initially being used by Coursera before they started using tons of metadata to refine and improve on it. If you have the amount of information and data that the team has collected over the years, why not use it to improve the experience for it’s users?
The Thinking behind Coursera’s Certificate Pricing
Emily wanted to improve the credential value of Coursera’s certificates when she started working there. Her working hypothesis, before she had a look at the data, was that people in developing countries would value the certificate as it would help them in their job applications. The data revealed otherwise – people in these countries were not taking these certificates.
Using various econometric and data science techniques including regression analysis, fixed effects model, etc. she discovered that learners in developing markets were far more price sensitive than learners in developed markets. It sounds pretty intuitive now but was not something the team had thought of back then.
Now came the question of how to price these certificates. Emily and her team ruled out trying A/B testing from the outset because of how unfair that could be for certain people. Instead, they went with a quasi experimental design. Emily took a bundle of developing markets and applied synthetic control methods and difference-in-difference analysis. Emily has described this in much more detail in the podcast and it makes for really interesting listening.
Analyzing and Dealing with Course Dropouts
This is a pretty common issue observed in today’s MOOC dominated world. People start a course but leave halfway through due to various reasons. Emily discussed how Coursera divides their approach into two parts to handle this:
- On the learner level
- On the instructor level
After experimenting with a predictive model that identified at-risk learners, the team then built several advanced models to recognize the reason why learners were dropping off at certain points. They have built several “intervention points” which help them understand where the user might need a nudge, or extra help, and then take action accordingly.
On the instructor side of things, Coursera lets them do A/B testing to understand their audience. Two versions of a course are created at the backend and the instructor can edit one of those to make it a bit different. Then when learners enroll in the course, they are shown one of these two versions by a randomization method.
Building a Data Science Team
“Team building, for me, is about collecting a superstar team.”
Emily looks at a number of components when building her data science team. I have mentioned a few crucial ones below:
- Role Fit: A very critical aspect when hiring any potential employee
- Identify Core Strengths: Have a broad base, but should also have a good level of expertise or depth in a domain
- Candidate’s Career Growth Chart: What is the person’s growth plan, where does he/she see himself in a few years, how will his/her career evolve, etc.
- Soft Skills: A desire to learn and continue learning, and collaboration ability with multiple cross-functional teams
Women in Data Science
Analytics Vidhya has featured Emily in the “most influential women in data science” list two years running now. She is an advocate of diversity in data science and enlightened us with her opinions on the subject in this podcast. Below are a few pointers she mentioned which hiring companies should take note of.
When crafting a job description, she makes sure it’s inclusive and gender neutral. She also believes it’s important to invest in diversity early when you have a small team. Co-hosting events with the ‘Women who Code’ organization has also worked out really well for Coursera.
Small things in the interview process also make a world of difference – making sure that all candidates are asked the same questions, ensuring feedback from interviewers is not being shown to each other until they have submitted their own feedback, etc. There are a lot more really thoughtful points in the podcast.
The Difference in having an Economics v Computer Science Background
“Both are tremendously valuable in isolation, and particularly valuable when combined.”
Emily agreed with Kunal’s assessment that people who comes from an economics background have more intuitive insights and hypothesis as compared to folks who come from computer science degrees. She mentioned that when taking these two separately, they have their own unique value. But they become truly valuable when they’re combined.
Her approach while building the data science team at Coursera has been to focus on a mix of backgrounds. The current set up has mathematicians, statisticians, computer scientists, and people from a host of diverse quantitative backgrounds. She firmly believes that focusing on one field is a recipe for failure. Data science is not a domain where you have obvious solutions. It takes a mix of ideas to extract insights from the challenges data poses.
This article isn’t enough to describe how insightful the podcast is. Emily is a great speaker and that clearly reflected in her conversation with Kunal. I had personally wondered how Coursera’s recommendation engine worked and this podcast emphatically answered my question.
We found out so much more about the platform and it adds a fresh perspective on how a leading education provider thinks and works using the power of data science and machine learning. You will learn a LOT of new things in this hour long podcast. Take out the time to hear Emily Glassberg Sands and get ready to be inspired by a true thought leader.You can also read this article on Analytics Vidhya's Android APP