Sreedevi Gattu — November 18, 2021
Data Exploration Data Visualization Intermediate Libraries Python Technique

This article was published as a part of the Data Science Blogathon.

 

Introduction to Sankey Diagram for Data Visualization

Very often, we are in a situation where we would have to visualize how data flows between entities. For example, let’s take the case of how residents have migrated from one country to another within the UK. Here, it would be an interesting analysis to see how many residents have migrated from England to say Northern Ireland, Scotland, and Wales.

Image source

 

From this Sankey diagram visualization, it is apparent that more residents have migrated from England to Wales than to Scotland or to Northern Ireland.

 

What is a Sankey diagram?

Sankey diagrams typically depict the flow of data from one entity (or node) to another.

The entity from/to where data flows is referred to as a node – the node where the flow originates is the source node (e.g. England on the left-hand side) and where the flow ends is the target node (e.g. Wales on the right-hand side). The source and target nodes are often represented as rectangles with a label.

The flow itself is represented by a straight or a curved path is called the link. The width of the flow/link is proportional to the amount/quantity of flow. In the above example, the flow (i.e. migration of residents) from England to Wales is wider (more) than that from England to Scotland or Northern Ireland indicating more number of residents migrating to Wales than to the other countries.

The Sankey diagrams can be used to represent the flow of energy, money, costs, anything that has a notion of flow.

Minard’s classic diagram of Napoleon’s invasion of Russia is perhaps the most famous example of the Sankey diagram. This visualization using the Sankey diagram displays very effectively how the French army progressed (or dwindled?) on its way to Russia and back.

sankey diagram

Image source

Now, let’s see how we can use python’s plotly to plot a Sankey diagram.

 

How to plot a Sankey diagram?

For plotting a Sankey diagram, let’s use the Olympics 2021 dataset. This dataset has details about the medals tally – country, total medals, and the split across the gold, silver, and bronze medals. Let’s plot a Sankey diagram to understand how many of the medals a country won are Gold, Silver, and Bronze.

 

df_medals = pd.read_excel("data/Medals.xlsx")
print(df_medals.info())
df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True)
df_medals.drop(columns=['Unnamed: 7','Unnamed: 8','Rank by Total'], inplace=True)

df_medals

code

code

A basic plot

We will use the plotly’s go interface Sankey that takes 2 parameters – nodes and links.

Note that all the nodes – source and target should have unique identifiers.

In this case,

  • the Source would be the country. Let’s consider the top 3 countries (which are the USA, China and Japan) as the source nodes. Let’s mark these source nodes with the following (unique) identifiers, labels and colours
    • 0: United States of America: green
    • 1: People’s Republic of China: blue
    • 2: Japan: orange
  • the Target would be the Gold, Silver and Bronze medals. Let’s mark these target nodes with the following (unique) identifiers, labels and colours
    • 3: Gold: gold
    • 4: Silver: silver
    • 5: Bronze: brown
  • the Link (between the source and target nodes) would be the number of medals of each kind (Gold, Silver, Bronze). From each source, we will have 3 links originating and each one ending in the target – Gold, Silver and Bronze. So we will have a total of 9 links. The width of each of the links should be the number of Gold, Silver and Bronze medals. Let’s mark these links with the following source to target, values and colours
    • 0 (USA) to 3,4,5 : 39, 41, 33
    • 1 (China) to 3, 4, 5 : 38, 32, 18
    • 2 (Japan) to 3,4,5 : 27, 14, 17

We will need to instantiate 2 python dict objects to represent the

  • nodes (both source and target): with labels & colors as individual lists and
  • links: source node, target node, value (width), and the color of the links as individual lists

and pass this to the plotly‘s go interface Sankey.

Each index of the lists – label, source, target, value, and color – corresponds to one node or link respectively.

NODES = dict( #    0                 1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = ["seagreen",                 "dodgerblue",                  "orange", "gold", "silver", "brown" ],)
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],)        # Source Node: 2 - Japan
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.show()

country and medal

 

Sankey diagram – a basic plot

Here we have a very basic plot. But do you notice how the diagram is too wide and Silver appears before the Gold? Let’s adjust the position of the nodes and the width.

 

Adjust the position of nodes and width of the diagram

Let’s add the x and y positions for the nodes to explicitly specify the positions of the nodes. The values should be between 0 and 1.

 

NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
            x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
            y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()

 

With this, we get a compact diagram:

olympics sankey diagram

 

Sankey diagram – node position adjusted

 

See below how the various parameters passed in the code map to the nodes and links in the diagram

 

maps to sankey diagram

Sankey diagram – how code maps to diagram

 

Add meaningful hover labels

The plot is interactive. You could hover on the nodes and the links for more information.

hover label

Sankey diagram – with default hover labels

 

Currently, the information displayed in the hover labels is the default text. When you hover on the

  • nodes, the node name, the number of incoming flows, the number of outgoing flows and the total value is displayed. For instance, 
    • node United States of America has a total of 11 medals (=39 Gold + 41 Silver + 33 Bronze)
  • node Gold has  a total of 104 medals (= 39 from the  USA, 38 from China, 27 from Japan)
  • links, the source node name and target node name and the value of the link is displayed. For instance, the link from the source node USA to the target node Silver has 39 medals. 

 

Don’t you think the labels are too verbose? All these can be improved.

 

Let’s improve the format of the hover labels using the hovertemplate parameter

  • For the nodes, since the hoverlabels are not giving any new information than what is already present, let’s take the hoverlabel off by passing an empty hovertemplate = ” “
  • For the links, we can make the label concise in the format <country>-<medal type>
  • For both the nodes and links, let’s have the values displayed with the suffix “Medals”. e.g 113 Medals instead of just 113. This can be achieved by using the update_traces function with appropriate valueformat and valuesuffix.

 

NODES = dict( #           0                               1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],
hovertemplate=" ",)

 

LINK_LABELS = []
for country in ["USA","China","Japan"]:
for medal in ["Gold","Silver","Bronze"]:
LINK_LABELS.append(f"{country}-{medal}")
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],         # Source Node: 2 - Japan
label = LINK_LABELS, hovertemplate="%{label}",)

 

data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey'))
fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell"))
fig.show()

sankey diagram

Sankey diagram – with improved hover labels

 

Generalize for multiple nodes and levels

Nodes are referred to as source and target with respect to a link. A node that is a target for one link can be a source for another.

  • The code can be generalized to handle all the countries in the dataset.
  • We can also extend the diagram to another level to visualize the total number of medals across the countries.

 

sankey diagram

 

End Notes

We saw how Sankey diagrams can be used to represent flows effectively and how plotly python library can be to generate Sankey diagrams for a sample dataset.

About the author

Sreedevi Gattu

A technical architect who also loves to break complex concepts into easily digestible capsules! Currently, finding my way around the fascinating world of data visualizations and data storytelling!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *