Here’s How to use Sankey Diagrams for Data Visualization

Sreedevi Last Updated : 16 Oct, 2024

8 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Sankey Diagram for Data Visualization

Very often, we are in a situation where we would have to visualize how data flows between entities. For example, let’s take the case of how residents have migrated from one country to another within the UK. Here, it would be an interesting analysis to see how many residents have migrated from England to say Northern Ireland, Scotland, and Wales.

Image source

From this Sankey diagram visualization, it is apparent that more residents have migrated from England to Wales than to Scotland or to Northern Ireland.

What is a Sankey diagram?

Sankey diagrams typically depict the flow of data from one entity (or node) to another.

The entity from/to where data flows is referred to as a node – the node where the flow originates is the source node (e.g. England on the left-hand side) and where the flow ends is the target node (e.g. Wales on the right-hand side). The source and target nodes are often represented as rectangles with a label.

The flow itself is represented by a straight or a curved path is called the link. The width of the flow/link is proportional to the amount/quantity of flow. In the above example, the flow (i.e. migration of residents) from England to Wales is wider (more) than that from England to Scotland or Northern Ireland indicating more number of residents migrating to Wales than to the other countries.

The Sankey diagrams can be used to represent the flow of energy, money, costs, anything that has a notion of flow.

Minard’s classic diagram of Napoleon’s invasion of Russia is perhaps the most famous example of the Sankey diagram. This visualization using the Sankey diagram displays very effectively how the French army progressed (or dwindled?) on its way to Russia and back.

Image source

Now, let’s see how we can use python’s plotly to plot a Sankey diagram.

How to plot a Sankey diagram?

For plotting a Sankey diagram, let’s use the Olympics 2021 dataset. This dataset has details about the medals tally – country, total medals, and the split across the gold, silver, and bronze medals. Let’s plot a Sankey diagram to understand how many of the medals a country won are Gold, Silver, and Bronze.

import pandas as pd
df_medals = pd.read_excel("Medals.xlsx")
print(df_medals.info())
df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True)
print(df_medals)

A basic plot

We will use the plotly’s go interface Sankey that takes 2 parameters – nodes and links.

Note that all the nodes – source and target should have unique identifiers.

In this case,

the Source would be the country. Let’s consider the top 3 countries (which are the USA, China and Japan) as the source nodes. Let’s mark these source nodes with the following (unique) identifiers, labels and colours
- 0: United States of America: green
- 1: People’s Republic of China: blue
- 2: Japan: orange
the Target would be the Gold, Silver and Bronze medals. Let’s mark these target nodes with the following (unique) identifiers, labels and colours
- 3: Gold: gold
- 4: Silver: silver
- 5: Bronze: brown
the Link (between the source and target nodes) would be the number of medals of each kind (Gold, Silver, Bronze). From each source, we will have 3 links originating and each one ending in the target – Gold, Silver and Bronze. So we will have a total of 9 links. The width of each of the links should be the number of Gold, Silver and Bronze medals. Let’s mark these links with the following source to target, values and colours
- 0 (USA) to 3,4,5 : 39, 41, 33
- 1 (China) to 3, 4, 5 : 38, 32, 18
- 2 (Japan) to 3,4,5 : 27, 14, 17

We will need to instantiate 2 python dict objects to represent the

nodes (both source and target): with labels & colors as individual lists and
links: source node, target node, value (width), and the color of the links as individual lists

and pass this to the plotly‘s go interface Sankey.

Each index of the lists – label, source, target, value, and color – corresponds to one node or link respectively.

NODES = dict( #    0                 1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = ["seagreen",                 "dodgerblue",                  "orange", "gold", "silver", "brown" ],)
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],)        # Source Node: 2 - Japan
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.show()

Sankey diagram – a basic plot

Here we have a very basic plot. But do you notice how the diagram is too wide and Silver appears before the Gold? Let’s adjust the position of the nodes and the width.

Adjust the position of nodes and width of the diagram

Let’s add the x and y positions for the nodes to explicitly specify the positions of the nodes. The values should be between 0 and 1.

NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
            x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
            y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()

With this, we get a compact diagram:

Sankey diagram – node position adjusted

See below how the various parameters passed in the code map to the nodes and links in the diagram

Sankey diagram – how code maps to diagram

Add meaningful hover labels

The plot is interactive. You could hover on the nodes and the links for more information.

Sankey diagram – with default hover labels

Currently, the information displayed in the hover labels is the default text. When you hover on the

nodes, the node name, the number of incoming flows, the number of outgoing flows and the total value is displayed. For instance,
- node United States of America has a total of 11 medals (=39 Gold + 41 Silver + 33 Bronze)
node Gold has a total of 104 medals (= 39 from the USA, 38 from China, 27 from Japan)
links, the source node name and target node name and the value of the link is displayed. For instance, the link from the source node USA to the target node Silver has 39 medals.

Don’t you think the labels are too verbose? All these can be improved.

Let’s improve the format of the hover labels using the hovertemplate parameter

For the nodes, since the hoverlabels are not giving any new information than what is already present, let’s take the hoverlabel off by passing an empty hovertemplate = ” “
For the links, we can make the label concise in the format <country>-<medal type>
For both the nodes and links, let’s have the values displayed with the suffix “Medals”. e.g 113 Medals instead of just 113. This can be achieved by using the update_traces function with appropriate valueformat and valuesuffix.

NODES = dict( #           0                               1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],
hovertemplate=" ",)

LINK_LABELS = []
for country in ["USA","China","Japan"]:
for medal in ["Gold","Silver","Bronze"]:
LINK_LABELS.append(f"{country}-{medal}")

LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],         # Source Node: 2 - Japan
label = LINK_LABELS, hovertemplate="%{label}",)

data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey'))
fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell"))
fig.show()

Sankey diagram – with improved hover labels

Generalize for multiple nodes and levels

Nodes are referred to as source and target with respect to a link. A node that is a target for one link can be a source for another.

The code can be generalized to handle all the countries in the dataset.
We can also extend the diagram to another level to visualize the total number of medals across the countries.

End Notes

We saw how Sankey diagrams can be used to represent flows effectively and how plotly python library can be to generate Sankey diagrams for a sample dataset.

About the author

Sreedevi Gattu

A technical architect who also loves to break complex concepts into easily digestible capsules! Currently, finding my way around the fascinating world of data visualizations and data storytelling!

Source code

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Sreedevi

Data Exploration Data Visualization Intermediate Libraries Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction

Tools

Libraries

Plots

Use cases

Here’s How to use Sankey Diagrams for Data Visualization

Introduction to Sankey Diagram for Data Visualization

What is a Sankey diagram?

How to plot a Sankey diagram?

A basic plot

Adjust the position of nodes and width of the diagram

Add meaningful hover labels

Generalize for multiple nodes and levels

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ