The DataHour Synopsis: How to Stay Relevant in World of AI?
Overview on AI
Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”.
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 29th March 2022, we were joined by Anastasiia Molodoria for a DataHour session on “How to Stay Relevant in the Booming World of AI?”
Anastasiia has a strong math background and experience in predictive modelling, NLP (Natural Language Processing), data processing, and deep learning. She has successfully integrated ML, DL, and NLP solutions for retailers and product tech companies considering optimization and automation of routine daily tasks and increasing business efficiency.
Currently, she’s working at MobiDev as the Data Science Team Leader.
Are you excited to dive deeper into the world of Data Science and Machine Learning? We got you covered. Let’s get started with the major highlights of this session: How to Stay Relevant in the Booming World of AI?
Introduction on AI
From AI this session, you’ll have two learnings:
- First, what are the most popular AI directions?
- Second, understanding where to start in order to successfully work in these areas.
Anastasiia covered these topics by considering business cases for a deeper understanding of the value of AI integration for solving real-world problems. Also, this will help you get insight into what main steps should be done for achieving the successful delivery of an ML product to the client. And how to provide the right expectation to the business, when you don’t know beforehand the exact output of your ML research.
Prerequisites: Some basic understanding of Data Science.
So, let’s dive into the ocean of AI.
Starting AI Project: How to Provide the Right Estimates and Meet Expectations?
With the few basic examples, we’ll try to understand what are the right estimates and expectations we need to meet to click start a project. So, let’s begin.
For the same, first, we need to figure out two things: PoC vs MVP
POC VS MVP
What exactly these two terminologies are and why do we need them?
We need these to know:
- What business goal we are solving?
- Do we have any idea how to solve this task?
- What are the evaluation criteria?
- What technology we should use?
Now, let’s look at how PoC and MVP work.
POC – Proof of Concept
It takes an input (a task we need to solve) and helps us in getting desired output.
The output is given by PoC:
- understand whether you have the capabilities to develop the solution
- help to make more accurate estimates
- understand the necessity of engaging 3rd-party developers (back-end, front-end, etc)
- have a great basis for an MVP solution
- show expertise in practice
MVP – Minimum Viable Product
It helps in maintaining a balance between Minimum and Viable product types (classification), that is, through this, we can get an optimal set of features to start with.
For example, Donuts in the market. The market is flooded with donuts and suppose, you are a newcomer in this business. Only knowing how to make donuts will not benefit your business because there are so many brands that already exist. So, what new you need to emerge, is basically the idea that will boost your business and make this successful. That idea is the extra feature that you’ll add to your donut. The extra feature is called MVP Solution.
No, let’s see how to/how not to build MVP with another example.
Explanation of the example: The 1st way of making a car is not a minimum viable product because suppose if something goes wrong at step three, it’s not possible to get the desired product successfully. And all our efforts whether it is money or time or effort, all go in vain.
Moreover, the 2nd way of making is the perfect way of building a minimum viable product.
This was all about PoC and MVP individually.
Now let’s know when to use PoC and when to use MVP.
Here, we’ll answer a basic question, Do we know where to start?
Case1: If yes, the next step is, are the goal and all steps clear?
- If yes, we’ll choose MVP.
- If not, we’ll choose PoC.
Case2: If no, you don’t have any idea, then you’ll have to choose PoC.
CRISP-DM Process (Cross-Industry Process for Data Mining)
It’s a project that is not a sequential development. For example (sequential development project), developing a mobile application, here, we’ll know what will to the next step and get the desired result. Data Science projects are iterative. For this type of project we need to know:
Business Understanding: Ask your client what he/she wants to develop.
Data Understanding: After the assignment of data by the client to you. There might be a situation where you are not able to get insights from the data. So, just connect with the client again to get a proper understanding of the data.
Data Preparation: This is one of the major dimensions that a data scientist has to look into for whatever projects you are handling.
Modelling: Select the model that fits your idea the best. But, if you observe some glitch in the model, then, you can go back and look into the data preparation again. And, then by correcting, a new model or correction in the model can be done.
Evaluation: Evaluate the data whether the model will work or not. If this works, then go for the deployment. If not, you need to understand the business again.
Deployment: Deploy the AI project.
New Project: Specifics of AI Estimation
This is a different project. We need somehow to estimate it. And if for example on one side we have some mobile application where more or less it’s clear that we have some specific operating system. Here, we need to add some buttons, etc. How properly you’ll estimate this if you don’t know the results in advance. So you don’t know whether it will work, or what accuracy you will have but you need to do it. This is a frequently asked question so please estimate it.
Few Recommendations For the Projects
Be sure that you can solve the core task. Start with PoC.
Why? If you are not sure. Because, for example, if the idea is to develop a mobile application based on AI and if this main core task cannot be solved. There is no need to gather all other developers at all because we are not solving the main functionality that we have. So ask clients to start with PoC if you are not sure, it’s fine completely.
No commitment on specific numbers in metrics.
Why? It can be the case when clients ask you something like any accuracy commitments. Don’t do this because you don’t know the results in advance. You can describe it to the client. What you can do, for example, in the first stage, select several models that you are going to try and usually these models are tested by some open-source data sets. There you receive some metrics and share this look with the client. Convey the client that you are not sure how it will be on your data. Because we didn’t try and model needs to be developed and it’s fine.
Providing the client with the project risk and limitations it’s crucially important.
Why? Because for example, let’s imagine that you test your POC or experiment with audio files and it looks good to you. Then you are sure that it will work in production. But when it becomes the product, the data you see is completely different. Like, there is a lot of background noise and the approach is not working. So better to write it in risk. If some output gets wrong you can refer to this point that you provided to the client in the very beginning.
Explain this project flow to the client.
Like, As we described to you. To describe it to the client so that you will be on the same page.
More decomposition and more understanding of how to achieve the goal.
So, if you have more understanding it’s great because first of all you will be more confident in your estimates and achieving the goal. This is the main point here as many details you will have to reach your goal better.
Clarify the necessity of runtime.
It’s important but the client doesn’t tell about it. It’s a true story we will cover it today a bit more later on. So if the client doesn’t tell about runtime it doesn’t mean that it doesn’t matter so better to clarify on your own.
How to Deliver a Result to Client?
- Demo and visualization are the best options (python/R visualization tools, streamlit, gradio, etc.).
- Even if you have a small subset of the client’s data – use it for the demo.
- Make sure that client understands your point.
- Report with all details of your work.
Popular AI directions Overview
When you show something abstractive means when you apply a model to some open-source data it’s one
story completely. Another story is when clients don’t provide you with a lot of data (eg, three pictures) but when clients see some insights from you on their own data, it’s a completely different story. You will get more trust from the clients definitely. So even if you have a small amount of data try to use it for a demo to the client. Make sure that clients understand your point.
We all are working in the data science area where it’s really easy to lose people because we have here a lot of technical details and often client is not technical as we are. So we have to describe complicated stuff in easy words. So, make sure that the client understands your plan. Ask questions if it was clear or maybe you should paraphrase, it’s fine. But it’s better to do each time to avoid any miscommunication in the future because it’s really even worse if you will have it.
And the last recommendation, you can provide a report to the client with all details of your work. It’s really great point because when you finish your meeting you can share this report and the client can go through these details. The most interesting one includes optimization- computer vision, NLP, and time series.
Business case: Time Series Predictions
Let’s understand this with an example of a Cafe Chain Owner.
The main goal of this owner is to support and maintain the business. This owner came to you with two questions:
- He wants to know the number of products that will be sold
- understand the performance of employees who is a good performer who is not so good.
As input data client provides you with a SQL database with a bunch of tables. The connection between these tables or data what is the expected solution here. So if the client wants no number of products that will be sold.
What is the Expected Solution?
- We can predict the amount of products.
- For the performance of employees, we can suggest developing some kind of employees rating based on sold products, tips or something we have in the data.
But this isn’t enough, try to think wider and deeper as a competent data scientist.
What Else You can Suggest?
- Recommendations for the best-sold products, so, the client will know what is going like in pairs. So employees can apply cross-selling and increase benefits and revenue.
- You can suggest employees anomalous detection. For example, in terms of employee analysis whether some employees are cheating and we can try to detect this anomaly in our data.
- The popularity of products per hour per day. This is a great point for marketing strategy perhaps to apply some advertisement during specific days or weekends or lunchtime.
- You can suggest a dashboard, so, a client will be able to see data in real-time.
- You can suggest clustering analysis too, so, having data we can have customers based by groups. And identify this behaviour inside these groups and again.
Business Case: NLP
We’ll understand this with an example. There is a business owner who has a product and the main goal of this person is:
- to sell the product and
- provide customer support
Let’s imagine customer support via chats because we need to have somewhere text in this business case. So it will be in the chats and this client came to you with two requests:
- Have an understanding of employee performance.
- Help in providing services more effectively.
What Solution You’ll Propose for AI?
We propose to the client:
- Sentiment Analysis: Based on text can identify the emotion and tone of the conversation. So, we can detect some negative cases and try to understand who is from employees have more negative conversations with. Try to make some analysis based on these sentiments.
- Text Summarization: To summarize all conversations. For example, a customer came to the chat support and reference some problem that he or she referred to before. She said, I talked with someone a few days ago and this agent needs to go to the database to find this conversation, the ticket to read to understand. So it takes a lot of time. There is nothing good here. So this summarization will help to speed up this process definitely.
- Keyword Detection: This is the idea to detect keywords from conversations. From these keywords, we can apply some text to this conversation. For example, if some problem was already solved by some agents and another agent is faced with the same problem. It will be easy to find with some text. Because there is some searching rule here.
How to Start?
Tabular Data/ Structured Data is a form of database which consists of a few rows and columns. We can say tabular data is a table that stores data of different types no matter whether it is boolean, number or alphabet. Tabular Data makes data ready for insights more efficient. Usually, it deals with this task, but not finite at least. We deal with tabular data that has three main tasks – regression, classification, and clusterization.
Regression: This task is for predicting some specific number like price, sales units, means a
Classification: This is based on assigning some class to the observation. For example disease detection,
whether it’s good or sentiment.
Clusterization: It’s we don’t know the number of classes but we want to group our data in some specific amount of groups as we discussed previously customer groups detection so wrap up people by behaviours.
Data provided for AI Projects
It’s true that expectations and reality are different. We are expecting that we will not have missing data and all variables are well known and data is clean and everything is fine our goal is to apply the model and tune it. But,
this is not reality. To have good insights into data we need to follow:
Understanding the data: In wrapping up the table, classical table or data task the main and the first important step is data understanding. So all of this unknown stuff should be completely understandable at this stage so you will need to have a full understanding if it’s not clear ask the client. Because if you are not sure about the data. It’s not possible to develop a really valuable model.
Data Cleaning and preparation: Data to be ready for modelling.
Feature engineering: Before modern feature engineering, it was a great and interesting step. You can generate new features, you can get new insights, and you can discuss them with the client. So like all experiments are up to you and it’s really interesting. From the modern step, you can do this task. But you can go back to the official engineering or data cleanup again. We are working with an iterative process it’s not sequential.
Modeling: So like all experiments is up to you and it’s really interesting. From the modern step, you can do this task. But you can go back to the official engineering or data cleanup again. We are working with an iterative process it’s not sequential.
Evaluation: So it’s fine and evaluation, so validate data on your data set. Make sure that you are not overfitting and that everything working as you expected.
How can this result be improved further?
- Look at the data from different perspectives
- Add 3rd-party data to enrich the model
- Try to get new insights from text features
NLP (Natural Language Processing)
This is basically the area of working with text. The presenter found out really great research and you can click the link if you are interested to read more NLP market. But the main idea here in 2020 is the amount of investment in NLP and we are expecting huge growth. And considering the number of projects that we are working on with NLP is quite huge. This area will definitely grow. And in this research, they cover two key growing directions.
- First, it’s a cloud-based solution like AWS, or GCP, so temporary use of some NLP solution.
- And second is the increasing usage of smart devices to facilitate smart environments.
What does it mean? It means that we all get used to SIRI usage, for example, if you want to find out something on Youtube when you turn on tv we don’t want to tap it, we want to just say and it’s much easier. It’s still NLP and all of the stuff will be even developed more and more in NLP we have also like a lot of directions.
Popular NLP directions
How to Work with Text? (NLP)?
First, we need to split text with punctuation or without penetration. It’s a different story but idea is to split text and then to each unique value assign some token. Of course, there are different options for spreading and for tokenization. But, the high-level idea is to convert this text to the digits and then we can work with this digit sequentially. So, it looks easier but if someone hasn’t worked with the text yet so hope you will have more understanding that it’s not really complicated under the hood.
We have two approaches generally:
- Classical NLP
- NLP with Deep Learning with Neural Network
Text classification and key vertex similarity can be solved with both approaches. But you can see some of the tasks. If we can solve tasks with classical approaches maybe it will not be relevant in a real world.
Now the main focus is on deep learning and model trained with neural networks produce really great. For example, we can enrich this model with our custom data so that this area is developed much more right now.
Intuitions for solving NLP
Let’s say the task is to generate summaries for a given input text.
- Training the NLP model from scratch is not really efficient in practice. Because it takes a lot of time a lot of money a lot of investment a lot of label data and it’s really hard and will take enormous time and money to do transfer learning.
- Transfer learning and pre-trained models are a better choice. Pre-trained models mean that someone already trained them and you can try to use them in your case. And transfer learning it’s using a pre-trained model, adding your data and just like to continue training so to enrich already print model with your custom data. It’s one of the best choices right now in terms of NLP.
- The data set used for the pre-trained NLP model matters. For example, you chose the NLP model trained on Wikipedia like general text and you have the main task in medicine. So most likely it will not work like 90 per cent and it will not produce great results. So try to find even if you are focusing on the retraining model trying to find a model that was pre-trained on the relevant data set. New NLP directions will add more to the growth.
New NLP Directions
- Extractive Summary
- Abstractive Summary
- Image Captioning
Two years ago, extractive summary approach was like everywhere and the abstractive was really rare and did not produce good results. But right now the situation is opposite completely. So extractive summary approaches the core intuition behind it. When we have text we split text by sentence, we score each sentence and apply some rating and select the most relevant. But output will contain the original sentences and most likely it will be out of context.
The abstractive summary approach can paraphrase and can generate new sentences. So on input, you have let’s say 10 sentence output can be one with paraphrased and like shortly. The summary of this text right now is more popular and it produces really great results. But several years ago it was completely opposite.
Image captioning: So the idea here is, that we have an image as input we are detecting objects here. And then we are generating some descriptions in the text. So it could be useful for some automatic creation of text or description. The photo’s interesting point is annotation for blind people. So we can convert it to the voice and like describe to the blind people what is shown here.
Some Helpful Research/Resources for NLP :
- Hugging Face huggingface.co – pretrained models and the possibility for fine-tuning
- Github repositories: implemented solutions and new modules
- Python packages: spacy, nltk, gensim, etc.
- 3rd-party API and external platforms: AWS, GCP, etc.
Computer Vision on AI
Computer Vision is an area in AI for working with images and pictures. So this is a classic task for computer vision.
Example: Let’s try to understand how to detect whether a cat’s dog is in the picture. We’ll perform this task with the help of computer vision directions.
How to Work with Image Data?
Pictures need to be converted to numbers because we are working with numbers. And how it looks behind when you read like ordinary jpg or png picture it looks like three-layered metrics. When two dimensions are height and width in pixels and the next one is three channels rgb.
For example, for a picture, three channels, it’s usually it can be a different mouth but usually three channels red green blue all of our pictures that we are looking in our like fonts most of them contains these three layers. And based on the intensity of each color we are achieving this result.
But what if you don’t know anything about computer vision. In that scenario:
- Read image data, and look at the properties and data you have (OpenCV, etc).
- Review and understand neural network components: layers, activation functions, etc.
- Review and understand the logic behind popular NN architectures: ResNet, UNet, etc.
- Choose Image Classification as your first task in your CV.
- Write and train your first custom NN model on ImageNet dataset.
- Use a pretrained model and apply a transfer learning approach.
What if we don’t have the dataset, the client doesn’t provide the dataset and wants us to detect a person. What is a data scientist you’ll do here, don’t say no to the client. You have three options:
- Try to find a suitable pre-trained model ( if no data).
- Apply transfer learning (if have a small about of data).
- If possible gather data apply a custom model (new technologies) and give fruitful results to the client.
New Technologies in the world of AI
- Human Pose Estimation
LIDAR: It’s really interesting. Maybe someone from you has already this camera on your mobile phone. The idea of this camera, it has laser and output it’s not to the pictures like we have. It’s with a picture with the information about where light is going from so plenty of information here from lidar. And it’s really high developed right now.
GANS: It’s really awesome interesting neural network in computer vision. The main idea behind that we have an input pictures and we can modify this picture. So, apply a smile or change race or hairstyle or hair cards or something in appearance. It can be useful in different areas and starting from generating new data samples for your data sets and photo editing and face animation.
Human Pose Estimation: The idea here is to detect key points. It can be useful for some fitness apps for example to identify whether you properly doing some exercises. The use case can be studied further from here
Human pose estimation guide
DS Optimization for Big Data Solution
In this scenario – How do meet such type of client expectations?
- Runtime is important every time, even when a client is not telling about it.
- Think about speed during development, better to write optimized code from scratch.
- Use GPU / CPU resources to the maximum.
- Multiprocessing is a great option for parallelization.
- Write a project with the pipeline running approach – make life easier in the future when your model will be deployed.
Few things that you must consider for writing a code on AI:
- AUXILIARY VARIABLES: Don’t create a lot of auxiliary variables, especially with heavy tables. It’s because you
- will use a lot of RAM memory
- ‘FOR’ LOOP: Avoid ‘for’ loop as much as possible. Try to apply vectorized functions: apply(), etc.
- BIG INPUT TABLES: Read-only necessary columns. Ultimately, it will speed up reading and reduce RAM memory consumption
- DATA TYPES: Use ‘lightweight’ data types as maximum as you can to speed up the processing time
- SQL QUERIES: Write effective SQL queries for getting data from DB. It will dramatically speed up the runtime.
Conclusion on AI
I hope you have enjoyed the session and afterwards, you will stay relevant in the world of AI. Secondly, the layman’s examples must have complimented your learning. Wish you, good luck. Learn more, grow high.
Leave a Reply Your email address will not be published. Required fields are marked *