What does not get discussed about analysis: Chaos to order
This article was published as a part of the Data Science Blogathon.
How can you ensure that the many lines of code you write meet the objective? The critical phases of analyzing data are after you have the dataset and before you set up the environment, to write codes. There is a science to what data we have access to and will it suffice to answer questions the business is seeking. That is a discussion for another time. Here we will focus on what happens once data is available.
An effort that is critical to the success of any analysis exercise is to organize thought.
Organizing data and thought
This is the phase before we begin to set up the environment to run the codes. The analysis begins much before we start writing the code. The process begins with understanding the question. Map it against the libraries of Python, Mathematics, statistics, color, language. That is not a comprehensive but a good starting point.
Let’s understand this with an example.
“The learning curve is an upward sloping line with a positive delta”
That line will be understood differently by each reader here. Take a pause and note your thoughts. Think about what will be an appropriate visual description, other than words. You are welcome to share your thoughts with Analytics Vidya and the author even before you complete reading. you can write at [email protected].
There are many layers rolled up in that statement. Towards the end of the article, we will enlist them. Here we visit how we organize thought and the impact of color on visualizations.
Understanding Data sets
A data set (or dataset) is a collection of data. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable. 
Tabulating is the first step towards organizing thoughts. It is a representation of something along rows and columns. Tabulation works great for numbers. Markdown cells are a good way to tabulate verbal descriptions.
Let’s look at the underlying data. Tabulating and quantifying aid understanding.
The temptation as a data scientist would be to say: value 1 performance at 25. If your first reaction was; why, who said so, can we make any conclusion basis this? Well, congratulations you are on your way to be a data scientist!
Visualization: Learning Curve: Line Graph
Visualization is a visual description of thought. Nonverbal descriptions are graphs.
Be aware that visualization of a learning curve need not always be a line graph. Nor does it have to be upward sloping. Data Science encourages us to build. Share your thoughts on other visualization possibilities of the learning curve.
Understanding color as a dimension
No matter how detailed the visualization is, if presented on a paper or screen it will be a 2-dimensional. Colour will add the third dimension.
Some of the simplest visualizations around our bar graphs. They are simple to decipher. They are simple also because they breed familiarity. Tally marks and bar graphs go back at least 10 years for most of us. Let us use Bar Graphs to understand color as a dimension.
The dominant features above can be presented as follows:
No matter how detailed the visualization is, if presented on a paper or screen it will be a 2-dimensional. All conversations will be between the x-axis and the y-axis. Colour will add the third dimension.
Layers to analysis
Preconceived notions and cultural influences add depth to data analysis. Data and Datum can be numbers, words, and expanding to be pictures and even voice. So everything that affects anything is data.
If you can organize thought, then you will be able to identify the variables. Each variable will be the heading of a column in the table. All readings together (if numeric) will be distributed. Data science will then become an exercise of understanding the interplay between variables.
Let’s enlist the various layers rolled up in the statement.
“The learning curve is an upward sloping line with a positive delta”.
- It is verbal.
- In Python, I will need functions that work with string.
- For a machine learning model, convert words to numeric form.
- Graphs will help us visualize.
- Mathematics will help quantify.
- Statistics will help make sense. What is the average performance? Can I add performance and learning effort? If the range of performance values were from 0- 100 but learning effort values ranged from 0 -1000. Then can I still use a line graph?
- Cultural influences: make a radial graph of the same dataset. Think about if it will make sense to the reader. Will the reader have the motivation to invest the effort needed to read a complex graph?
- Upward sloping line and positive delta: technical description.
In conclusion: organizing thought forms the bedrock of data analysis. Tabulation helps organize thought. Visualization will generate insights about variables. In the journey of data, scientists enlist as many layers. Explore the interplay between variables. Enjoy the journey! Share your thoughts with the author at