The DataHour Synopsis: Traversing Journey of an Analytics Problem
Overview on Analytics Problem
Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”.
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 30th May 2022, we were joined by Amitayu Roy for a DataHour session on “Traversing the Journey of an Analytics Problem.”
Amitayu Ray is a Data Science leader with 16 years of experience in Analytics consulting, AI solution design, client management and business development across industry practice areas. Helping telecom and media giants around the world realize value from large scale AI/ML implementation. Empowering organizations with out of the box AI solutions to solve modern and constantly evolving problems.
Amitayu is currently working as a Senior Manager in the field of Applied Intelligence, Strategy and Consulting at Accenture. He works at exchange strategy and consulting. He lead their analytics consulting, data practice, AI and machine learning enablement for the North America geography.
Are you excited to dive deeper into the world of Data Science and Machine Learning? We got you covered. Let’s get started with the major highlights of this session: Traversing the Journey of an Analytics Problem.
With this session, you’ll learn:
What are the objective of this session? Basically, this session is a data science journey, so, it’s more of a storytelling journey. Taking you of a problem from a business problem to an industrialized value-based solution that you give to your business. And this is what we are focusing on.
Here we’ll be focusing on how kind of value we realize from an analytics problem. It talks more about the processes
associated with a typical analytics problem and an analytic solution. It talks about the evolution of data science over the years and where we are heading. This helps you how do you really translate a business problem and you know how to translate it into a certain analytical form and also tells the outlines, the key responsibility areas for the different rules that are evolving that are there right now.
What this session suggest not to do? We are not doing deep dive into AIML algorithm. This is not a technical session on data engineering on big data cloud platforms. This is a process oriented session to help you understand the end-to-end journey where exactly does your work fit in. We are not going to talk about model performance parameters, feature engineering, etc. We’ll talk at a high level we are not going to get into the details of what exactly is going on whether those okay. We are also not going to talk about the agile approach to deliver AIML project.
Some foundational Updates from Data Science Industry
This is important, because, to understand a journey you need to have some basic foundational updates. For example: In this session there is a group of 250 individuals. If we need to plot this data on a maturity curve then some of you are already advanced in analytics, some are just starting up. So, kept this session as generic as possible. Presenter tried answering/narrating that journey in a manner so that it becomes clear to everyone.
How Data Science have Evolved over the Years?
So, let’s have a quick round up of how data science have evolved over the years.
In 90s, Analytics 1.0 emerges. So, this is when the BI emerged for the first time. Data started becoming very important asset. But it was mostly on Excel, VBA, etc. And you know the consumption of that from a business perspective, whoever was consuming that they were just looking at some manual reports. Somebody was opening an excel doing some pivots, etc.
In the early 2000s, from a data side we looked at the data warehouse-the ETL systems. From a report side we looked at the the Dashboards, Tableau and Power Bi. The most dominant system at that point of time was SAS and SPSS. All the statistical models working at that time were Regression Models, time series clustering. Analytics consumed these reports and dashboards were semi-automated. The models were run manually, even, the Feature Engineering was done manually too.
Here people start investing in building stronger data warehouses. Resultantly, emerges Analytics 3.0 in 2010s. By this time people have already realized that data is an important asset. So, big data has started to emerge. Hadoop has already come into picture. The Dashboards, Power BI are mostly automated by now.
The ML models comes into the picture for the first time in late 2000s. Eventually, become a stronger application as compared to SAS. From here building high-end ML models starts. But the models themselves are not fully automated. This means that you have to manually run models, check results, the roc curves, performance matrix, do model validations, etc-full manual check.
Now, we are in analytics 4.0, actually, between 4.0 and 5.0. This happened in 2015. The advancement of big data has transformed the way we manage/compute data. All the big players like the Amazons, the Googles, the Microsoft started investing in integrated cloud platforms. They realized the tremendous profit of a cloud platform service which can do end-to-end data management, reporting, visualization, modeling, model implementation, etc.
Web-based reports/Automated reports, Deep Learning, AI, the NLPs, the TensorFlow’s, simulation models, auto MLS became very common. Now, everybody started thinking about productionizing analytics/analytics model. The ML world came into existence around 2018.
Analytics 5.0 is where we are heading. So 2025 is when we are possibly going to reach a very different dimension of analytics. Then, the quantum computing, big data on cloud will become normal for most of the organizations; the visualizations will mostly become ERVR visualizations. In fact, with these visualization – tableau, power bi will also exist. Then we’ll have some production ready AI implementations. All new investing models will become end-to-end industrialized i.e. end-to-end automate.
The “Must Have” AI knowledge Assets for tomorrow
The must have skills\good to have skills are:
- Your ability to problem solve
- Know SQL and fundamentals of querying your data
- Fundamentals of mathematical step and statistical deductions
- the basic principles of computational algorithms
On top of that these foundational skills for the next three-five years since it’s a very evolving industry are:
- Changing big data engineering framework: It is going to be a critical skill that will probably play a big role.
- Knowledge of cloud platform architecture: With the AWS and Google cloud and Azure almost an entire suit of analytics products are available on the cloud.
- You need expertise in industry: Build your capability in one domain.
- Explainable AI and ML methodologies: For almost a decade now we have just bypassed these questions from businesses saying that these are black box models. All the AIML models are black box models that is no longer going to stay. So we need to come up with approaches by which the model methodologies can be explained.
- Implementation of analytics and MLAPs principles
One important take-away is-Upskill and Evolve yourself consistently to stay relevant in the market.
Analytics Problem Journey
Typical industry problem we encounter
Lets start from the problem in its very raw form. What does a problem look like.
- Industry problems are extremely vague. Moving forward many of these proposals/business meetings (with the CTOs, the CIOs, the CMOs) gives you a high level problem statement which might not make any sense.
- When you ask question against that problem you realize that there are too many unanswered questions.
Nobody is defining a clear outcome, so it’s your job to define that outcome.
- In many cases businesses do not give clarity around that issue.
So we as analytics consultants/data scientists need to have that clarity in our mind to be able to answer and address those questions.
Four Major Strategic Priorities
There are four major strategic priorities and this is where they earn their bread from. This is how they generate their money from the business.
- Enabling more revenue growth by employing different revenue-oriented strategies.
- Reducing optimization cost–Every business has a cost operational/capital cost, here focus is to optimize that cost.
- Improve/Re-engineer processes–there might be lots of inefficiencies in a process. So, to reduce them, you need to re-engineer your process.
- Improve customer experience
A Customer journey from Industrial Point of View and Role of Data
Imagine yourself as a customer. How do industry view you as a customer and what they do. This is the journey of a customer.
- Prospect Assessment: Industry try to identify who is the right customer for them. Example: Zomato has a very robust prospect assessment engine. This helps whom should they onboard i.e. they sends very specific niche kind of messages to people who are their customers to get them on board.
- Acquisition: It means how do you bring that customer on board.
- Onboarding and Engagement: How to engage more with the customer. Example: Amazon- you see a campaign that amazon offers. Next moment you join amazon. The way amazon make sure that you use their services by sending you niche kind of messages (eg- asking your opinion about a product). That is where the engagement part comes.
- Growth Marketing: It is typically when they engage with customers. They are trying to cross sell/upsell something.
- Loyalty and Operations: Loyalty-if they figure out that you are going to other competitors also like flipkart, etc; they do get that data and then they try to create differentiated services on the operation. They ensures that the customer service is on their tool. Example-Return request.
- Churn and Retention: When you are not engaging enough with amazon or just stopped using amazon. They have a feeling of customer churn and then they figure out new things to retain you.
- Feedback and Social Listening: The feedbacks you personally provides or industry get/gather from online media (eg-you published on twitter that you are not happy with amazon service). On the basis of these, they try to improve.
- Personalization: It means that they want to give you niche personalized service.
There is analytics applicable everywhere right from a prospect assessment to an acquisition to onboarding, growth, loyalty, etc.
Importance of Outcome and Value/Impact driven by the Solution
Churn is industry agnostic means that it could be for any service providing industry (banking, telecom, retail, consumer goods, e-commerce). Loyalty management and churn and retention are two major functions which
are associated with churns.
There are five major questions for any churn problem anywhere in the world.
- Why are they leaving
- Who are most likely to leave?
- Whom do business want to retain?
- What kind of actions should business take to routine
- If business have taken an action how should they target.
What we need to address here is- what is the impact or the value that the business is trying to achieve by
addressing this particular problem? As a data scientist, this business approaches us and ask for a befitting solution. The main reason behind churn is cost acquisition. This is the main thing why our organizations typically do these kind of churn analysis because cost of acquisition is very high. So they need to retain customers and ensure that we, data scientists, are able to sort of:
- what is the potential reason for churn
- how to retain customers
- what could be done to increase profit
- how to increase revenue with continuity
What is Probable Solution Here?
A. Prevent Revenue Loss
- Identify customers who are likely to churn
- Compute net value generated by a high-risk customer
- Retain high value – high risk customers with suitable retention offers
- Prevent revenue loss through retention.
B. Optimize Campaign Cost
- Compute cost of retention campaign
- Estimate the total budget of retention campaign (Cost*No. of leads)
- Identify the right customer to whom this campaign needs to be sent
- Calculate the ROI of the campaign, by calculating net revenue saved v/s net cost incurred
Hypothesis Driven Approach-The Most Effective Way of Problem Formulation
Hypothesis driven approach is a proven approach that the consulting firms/the analytics firms have sort of adopted for many years.
A hypothesis seeks to explain why something has happened, or what might happen, under certain conditions. They are often written as if-then statements. So any hypothesis driven approach has five
major ways to solve that problem:
- what is my end goal?
- how do I reach there?
- what is the journey towards that end goal?
- and what kind of input information would be required through the journey?
- then, what are the key milestones that needs to deliver?
- whatever I am doing today I am building complex AI models but is the business able to consume that?
Now, we’ll look how to solve all these questions and draw analytics.
Stages of an Analytical Problem Journey-the Analytics Solution Hierarchy
Solution of the analytical problem is:
- Problem Formulation: We build an issue tree. It means you are getting a problem or breaking down the problem into simpler blocks which are easy to understand. Then, make sure you are following a MECE approach (hard to be 100% compliant).
- Solution deliveries: Build hypothesis from the hypothesis chart. Then, do analysis outcome from each of these. Lastly, validate the hypothesis on data.
- Data Requirement: Key attributes/features required for validation and testing. Then, identify the root variables and sources from where they are collected. Lastly, do assessment of data availability and accessibility.
- Analytical Approach: Perform hypothesis testing-either approve, disapprove or iterate. Generate helpful insights from the testing. Combine these hypothesis and use them with AIML models. Then, test accuracy and stability of these models. Lastly, explain what-so-ever the model output is and with proof.
- Implementation roadmap: Scale and map is the end-goal. Make sure that data and model pipelines have been built as a part of MLOPs. Then, integrate it with client cloud system and do SIT. Do monitoring after post implementation also. Also, provide end-to-end enablement training to run operations.
Example: How to Perform Issue Tree?
Hypothesis Framework – Build on these Issue Trees
For every problem, we’ll build a issue tree and for these we build a hypothesis framework.
Key Roles Involved Across the Stages
The kinds of roles involved through the journey are:
It is mostly the business analyst, the domain experts, very small representation of data analyst, a very small representation of data scientists, so, all of them are sitting together brainstorming how to translate that business problem into an analytical problem.
Business analyst or a typical consultant plays big role here because you know they will look at the industry practices and will look at the benchmarks. And then accordingly they will try to come up with the hypothesis.
Data engineering is as expected the bulk of the work. There is from a data engineer, data scientist and data analyst also play a small role in that field of space.
The analytical approach which is where the model building, the insights the visualization everything happens. Data scientist has a big role to play as a data analyst. Data engineer also has a significant role to play because they are the ones who are building up the data.
The ml engineers role is to implement these models. This is a new role that is coming up in the industry people who implement models into an existing framework. This kind of a role requires you to have understanding of
models as well as understanding of a technical architecture.
Therefore, we can say particular role plays a big part in the solution implementation. The ml engineer, the business analyst also plays a big part because they are the ones who are bringing all the entities together in making this solution a success.
Data Engineering – The Power-house of Data Science Solutions
Purpose of Existence
- AI Data Foundation: Data engineering creates the foundation for all AIs. So bringing together the data from multiple different sources (structured or unstructured) ensuring the data is correlated and is ready for consumption. This is creating the foundation legacy modernization i.e. if you have an old platform like DB2; data engineers job is also to ensure that this legacy platform is modernized into a modern data architecture.
- Data Lake Mainstreaming: Data lake is where you store all your information. It’s a consolidated view of all your data sources into a unified platform. Design of a data lake followed by a data warehouse-customer 360. There are lots of compliance related stuff operations – data security and governance that comes here in data like mainstreaming.
- Data on cloud: In the next 10 years, all the data will come on cloud. There will be no on-premise systems at all existing. Even the smaller organizations are moving to cloud. So, you need to have the foundational knowledge/know how of the AWS, GCP, etc. And how does it sort of integrate with all other applications.
- Data and analytics consumption: A data engineer’s role is evolving from consumption of data and consumption of analytics, dashboards, etc. Now, they ensures that the data is in such a state that can be pulled automatically from the source system, and then fed into a data lake. Then create and publish report by pulling data from data lake. Consumption means that the data engineering team is end-to-end enabling these functionality.
Why We are Saying Data Engineering as Power-house and its Key-trends?
Data engineering is the powerhouse or the mitochondria of the today’s data-driven world. We as data scientists are a data scientist for many years. The data engineers are the ones who holds everything together. The data engineering team is ensuring that all the data flow and should have no gap in that solution which is same as business self-service. So that the business does not have to do anything, everything will be enabled by the data engineers and data scientists.
It’s key-trends are:
- Cloud Deployments: Majority of organizations are leveraging cloud solutions to rapidly standup analytics or operational environments.
- Governed Data Lakes: They are evolve to become center sources of Data to enterprise via data catalogues to search and shop for data capabilities.
- Rapid Insights Discovery: Investing in data exploration capabilities to identify patterns, trends and unknown opportunities.
- Business-Self Service: Greater use of search, SQL, NLP and self-service tools for intelligent data preparation, operational intelligence and visualization.
- Modern-Hybrid Architecture: Companies are leveraging several technology components for accelerating data movement.
- Smart-Data Management: Evolution of intelligent solutions to data management challenges, automation and learning based solutions to integration and data quality.
Key-Pillars to Data Engineering Engagement
Assessment of Existing Data Architecture: To address a churn problem, first thing is assess the architecture of your business-
- End-to-end knowledge of their data.
- Do a basic data discovery – for hypothesis what kind of data sources do I need
- What kind of existing technology stack do they have
- Is that data accessible how do i access it
These are the initial steps of building up doing an assessment of the enterprise data architecture. Those who are a little advanced know what a sandbox is – setting up a sandbox and virtual workbench. And connecting your data sources and your applications within the sandbox to your data warehouses and data lakes is also part of the data engineers.
Development of Analytical Data Record: So for churn problem we need to build up an analytical data record; which is a customer level data set which will help me build models. So, the data engineer as well as the data scientists are going to build the data together. This tasks acquisition of the data quality assessment, Creation of the features integration of the lakes-merge with the data warehouse, data dictionaries, and metadata management.
Deployment of Analytical Solution Framework: They create the data pipelines to automate end-to-end solution MLOPs. They are creating the scaling and automations in the data to ensure that the codes are configurable and deployable and using parameter driven batch codes. Performing feature engineering based on use cases and applying meta data management.
What is a Customer360?
This is the output that the data engineer will produce for the purpose of model which is an analytical data record. A customer 360 is a view of the customer by which all the possible attributes of a customer are brought together under one platform like financial usage, service products, channel, etc.
When we build these exhaustive customer records it has thousands and thousands of features that cater to many many different kinds of use cases and not just churn. Churn is just one of those.
From Diagnostics to AI problem- How Does the Problem Evolve?
EDA (Exploratory Data Analysis) – Know as much about the data at your disposal (Churn Problem)
- Product churn or a relationship churn: If you are an atl customer you have many different products. You decided to stop the services of your post paid-that’s a product churn. But if you have stopped using service of all of the services-tv, broadband, etc that’s a relationship churn.
- Inactivity or a hard churn versus a soft churn: Somebody who is inactive for a long time may often be confused as a churn. Because that person is not doing anything. We might often predict that this person is not going to use my service any longer. But suddenly that person comes back, so, how do you differentiate between that. That is something that you need to understand clearly.
- Dealing with returning customers is the same point if a customer is inactive for a long time suddenly comes back. Do you want to call that customer as a churn.
- Voluntary versus Involuntary churn: Assume if you are telecom provider and decided to throw out someone yourself because he/she has been a very bad customer. That’s not a churn.
- Frauds and delinquents: So frauds are doing certain fraudulent activities. If they leave do you really want to call them as churn.
- and has to be clearly oriented with an outcome,
- and clearly quantified.
Generating Analytics Value through Implementation
Key Building Blocks for Value Realization
Start simple-start delivering smaller values and only then you can reach the goal.
Machine Learning Deployment Life Cycle