How Coursera uses Data Visualization and Clustering to Categorize Content
Coursera is one of the biggest educational platforms in the world. Launched in 2012 by Andrew Ng, the content on Coursera has increased multi-fold since. In fact, Andrew Ng’s Introduction to Machine Learning course has spawned thousands of careers in data science.
But with increased content comes the challenge of defining which category each course belongs to. This is such a critical aspect of a platform that serves up pinpoint recommendations to its users.
Courses on Coursera cover topics ranging from photography to probabilistic graphical models to constitutional struggles in the Muslim world. This diversity makes them hard to categorize. A couple of years ago, we overhauled our course categories and implemented a new categorization system we call domains and subdomains. This article covers how we defined and implemented that new system.
Table of Contents
- The Previous Course Categories
- t-SNE to the Rescue
- The General Structure of Coursera’s Content
The Previous Course Categories
Coursera’s original categorization scheme dated back to our founding in 2012, and was heavily influenced by the content available at the time. For example, we had five categories of computer science subfields, but only one category for all of the humanities. The categories were also manually and arbitrarily defined, resulting in redundancies (e.g., “Food and Nutrition” being nearly a subset of “Health and Society”) and vagueness (e.g., “Information, Tech & Design”).
Critically, the original categorization scheme was not meeting our need of effectively matching learner to content. For example, the “Medicine” category attracted two distinct groups of learners — because it contained two distinct groups of courses. The first were courses that appealed to healthcare practitioners (e.g., on clinical kidney transplantation or biocontainment for infectious diseases). The second were courses on public health issues that appealed to non-practitioners.
As our catalog expands to thousands of courses, we need a principled organization technique. We want categories that help the learner find the best content for them. This translates to the following criteria:
- Simple (as few categories as possible)
- Minimally redundant (as mutually exclusive as possible)
t-SNE to the Rescue
Rather than re-coding by hand, or replicating traditional university departments, we took a data-driven approach.
We sought to group our courses so that someone interested in one course in that group, say, playing the guitar, might also be interested in other courses in that group, say, songwriting or jazz improvisation. The algorithm known as t-distributed stochastic neighbor embedding (t-SNE) satisfies this requirement.
t-SNE identifies an arrangement of courses such that courses sharing common learners are close and courses that do not share common learners are far apart. For example, Complex Analysis and Galois Theory are close together since many learners take both, while Taking Care of Horses and General Relativity are farther apart as the two courses do not share many learners.
We utilized the t-SNE algorithm on courses in 2015 to produce the scatter plot output shown below. Each dot represents a single course. We then grouped these courses into categories by clustering (represented by the coloring).
The General Structure of Coursera’s Content
Looking at Figures 1 and 2, the first thing we see is that courses are organized in a globally consistent way: humanities, social sciences, and business courses are in the top right half of the plot, while natural sciences, engineering, and computational sciences courses fall in the bottom left half.
Digging in to a more granular level reveals additional nuance:
- Courses on business and finance are clustered together on the right
- Courses about the natural sciences (physics, chemistry, and biology) are on the left
- Courses on the computational sciences (math, cs, and statistics) are at the bottom
- Courses on the social sciences and humanities are at the top
Dissecting each of these large regions further, we see that even within each grouping, courses are arranged logically. For example, courses in natural history span a continuum roughly from the biological sciences to the physical sciences. Similarly, courses in humanities and social science range roughly from music to the visual arts, humanities, and social sciences, and then practical business.
Even at the level of individual courses, t-SNE captures the interdisciplinary nature of courses. Courses in business law, for example, fall on the boundary between business and law, and courses on quantitative methods for the social sciences fall between math and the social sciences.
Returning to the set of courses previously categorized as “medicine,” we now have three sub-categories. First is a relatively disjointed cluster of courses targeted to the medical professional (e.g., “Ebola: Essential Knowledge for the Healthcare Professional”). Second is a cluster of courses on healthcare policy (e.g., “Systems Thinking in Public Health”), and lastly, we have a cluster on basic biology (e.g., “Introduction to Genetics and Evolution”).
The result is noteworthy because there is no strong reason why t-SNE should arrange our courses by subject matter. We weren’t feeding in course descriptions or transcripts, just the enrollment behavior of learners. We attribute the clusterability of courses to the fact that learners are much more likely to be interested in multiple courses in a particular subject area rather than to be influenced in their course decision by non-subject-matter factors such as the style of instruction or the institution offering the course.
That said, this assumption does not hold across the full catalog. For example, among non-English content, the language of instruction is more a driver of enrollment than the subject area. Correspondingly, a French or Russian course is more likely to be grouped with other courses taught in the respective language than it is to be grouped with other courses on the same subject.
After some clean up, we landed at 36 course clusters, which rolled up into nine larger clusters. On the Coursera platform today, we term the original clusters “subdomains” and the larger clusters “domains.” We launched the new system of domains and subdomains in summer 2015. In the past three years, it has become an integral part of Coursera’s content discovery experience, allowing us to scale content categorization and drive personalization for each and every learner on our platform.
This post originally appeared on Coursera Engineering’s Medium page.
About the Authors
This post was a collaboration by the data science team at Coursera, the world’s leading platform for higher education. The team builds the statistical models and machine learning algorithms that power content discovery and help scale an engaging and personalized learning experience; leads the quantitative measurement, experimentation, and inference that informs the company’s product, business, and content strategy; and develops the analytics and direct data access for the platform’s university partners and enterprise customers. We believe the next generation of teaching and learning is personalized, accessible, and efficient — reaching a world of learners who need it — and are united in building that vision through data-informed decisions and data-powered products.