A Complete Beginner’s Guide to Data Visualization
What is data visualization?
“A picture speaks a thousand words.” Similarly an infographic/visual can help us analyze data and hidden patterns in a much easier way. This is a comprehensive guide to cover the basics of visualization. I have tried incorporating an example using Haberman’s Cancer Survival Data to show how visuals can help us find patterns in data that numbers fail to show. Let’s get started!
Why visualize data?
Data visualization is a way you can create a story through your data. When data is complex and understanding the micro-details is essential, the best way is to analyze data through visuals.
Visuals can be used for two purposes:
1. Exploratory data analysis: This is used by data analysts, statisticians, and data scientists to better understand data. As it is rightly called, it is used to explore the hidden trends, patterns in data.
2. Explanatory data analysis: Once the analysts understand the data and find their results, the best way to convey their ideas and findings is through visuals! This is used to craft a story that will appeal to the viewer offering deeper insights.
Exploratory analysis of Haberman’s Survival Data
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
The attributes include:
- Age of patient at the time of operation (numerical)
- Patient’s year of operation (year – 1900, numerical)
- Number of positive axillary nodes detected (numerical)
- Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
Let’s first start by using statistics to understand data:
We see there are 306 rows and 4 columns. Further upon seeing the attributes, we understand how data is distributed. To further find out how many examples of each class we have, we can use a bar chart.
We see that the data is imbalanced with more survivors than those who couldn’t survive. To further scan the data, let’s see different plots.
Probability Density Function
A large portion i.e from 30 to 80 years overlaps between the two classes.
People in the age group of 20-40 are more likely to survive,40-60 are more likely to not survive,60-80 age groups have equal chances of survival and death, and >80 have higher chances of not surviving.
Age alone cannot distinguish if a person will survive or not.
Box-plots tell us about the distribution of data and scan for outliers. Notice that the survivors have fewer nodes than those who could not survive. Interesting! Isn’t it. Also notice that even though the number of nodes is a more useful feature, there is some overlap with both the classes.
We see from the scattered points that irrespective of the year, the number of patients having 0 nodes have been survivors. Does this mean that 0 nodes ensure survival? See the violin plot!
From the plot above, we see that there are non-survivors with 0 nodes! Violin plots enable us to view the distribution and box plots in one visual. Useful! Isn’t it? There is so much we can learn from the visuals. Visualize to understand. Visualize to explain your understanding. I have compiled a few tips and tools to get you started.
Data Visualization Tools
Tableau: Simple to use, effective and secure. It’s very popular and used to pre-process and visualize data effectively. Data sharing is also possible.
Microsoft Power BI: Data Visualization platform focused on creating data-driven solutions for business problems. It is used to pre-process, analyze and share meaningful insights with ease. Other tools include FusionCharts, Dash, Plotly, QlikView.
MS Excel: This is the most common tool used by analysts to quickly handle data, sort, visualize and perform preprocessing on data.
Best practices and tips
Use a consistent coloring scheme for your visuals: While color adds meaning and beauty to a chart, it is often best to use colors for highlighting important details and not merely for attractiveness. Too many colors will destroy the purpose of coloring while using a single color or too many shades of one color can confuse viewers. Also, take into consideration the visually impaired while designing visuals. Use colors intuitively. For example: for sentiment analysis, we can use green color for positive emotions, red for negative emotions, and green for neutrals.
Make use of size, shape, and format to convey semantics: Using size, shape like circles, squares may add semantic meaning and thus help viewers absorb the data with ease. Also, notice that sometimes arranging bar graphs in ascending order makes more sense (in the case of ordinal data) rather than arranging it alphabetically or randomly.
Use legends, words to properly annotate data: Use labels wherever required but don’t clutter the graph with text. Use text data wisely. Place the visual data in a manner that is easy to grasp.
Use Interactive plots: Race graphs, interactive plots add value and help viewers engage with the data in greater depths.
Remove junk from the chart: Remove unnecessary junk from the chart that may distract the viewers. Don’t combine multiple views in a single visual to such an extent that it makes it difficult to comprehend. Use the scales to tell the real picture.
Labeling the data: Label the data accurately. Don’t over-label. Make sure the labels are visible and oriented properly. Don’t add dimensions to visuals that may lead to skewness.
Craft out a complete story: Focus on the bigger picture you are trying to capture. Do not provide inaccurate or misleading visuals. Use the visual tools wisely to speak more than the text would do.
Common mistakes to avoid while visualizing data
Using a visual when it might not be needed: If data can be communicated effectively with statistics, we don’t need to create visuals. Visuals make it easier to analyze what numbers cannot convey. Thus, choose wisely when to use a visual tool.
Are you really sure about what you are trying to convey? : Correlation does not imply causation. We need to ensure our results are backed up by proper research and experiments before jumping to causals.
Use of 3-D visuals: Ensure that the 3-D view does not hide a part of the data or distort the data. Use 3-D graphics with utmost care. Don’t add orientations that may fool the viewer and destroy the purpose of visualization.
Where to look for more resources & courses?
There are a lot of courses, blogs, and books out to help us understand visualization in depth.
For wonderful blog websites: https://www.tableau.com/learn/articles/best-data-visualization-blogs, Visualising Data, and Reddit being my favorite.
For free courses: Tableau provides free courses for data visualization that are a must-do. Kaggle also has free courses for basic data visualization with hands-on exercises. There are several courses available on Analytics Vidhya, Coursera, Udemy, Udacity which aid in learning.
For books: Refer to this curated list of books https://www.tableau.com/learn/articles/books-about-data-visualization with my favorite ones being The Visual Display of Quantitative Information by Edward Tufte and Storytelling with Data by Cole.
As producers of data, we need to ensure that we display the right information at all times. Manipulating consumers to make them see what we want to must be avoided at all costs.
As consumers of data, we need to view each visual critically to ensure that we see beyond what the visual persuades us to see.
I hope you enjoyed the content. For any queries, you can reach out to me at [email protected] or drop down a comment below.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.