Applying Software Engineering Process for more effective Data Science Projects
This article was published as a part of the Data Science Blogathon
An interesting title isn’t it? I thought the same when the idea to write a blog about this came to me.
If you are from a computer science background, you may be already knowing about the Software Engineering process. But if you are not then here are some basics for you to know.
What is Software Engineering?
IEEE, in its standard 610.12-1990, defines software engineering as —
Software Engineering is the application of a systematic, disciplined, which is a computable approach for the development, operation, and maintenance of software.
In simple words,
Software Engineering is the process of analyzing user requirements and then designing, building, and testing software application which will satisfy those requirements.
And What is the Software Engineering process?
In terms of software engineering, there not some rigid process but an approach to develop software. This process is divided mainly into five tasks– Communication, Planning, Modelling, Construction, and Deployment.
Communication: It is mainly about communicating with your customers to understand their requirements.
Planning-: It is about planning the whole process of the development of software.
Modelling: It is about creating models to better understand software requirements and the design that will achieve those requirements.
Construction: In this, the actual code is generated and testing is done.
Deployment: The completed software is delivered to the customer who evaluates the delivered product and provides feedback based on the evaluation.
These are the five basics tasks of the Software Engineering process. But it may be possible that some of these tasks may overlap.
Now you may be thinking
‘How does this help us in making more effective data science projects?’
I think this process may be getting applied by professionals either knowingly or instinctively.
So I am writing this mostly for my fellow students, but since learning never stops, we all are students for our entire lives, aren’t we?
But then how to use this process for data science projects done by students?
Well, let’s just split the process into five tasks again—but for the data science project now.
Task 1. Communication:
As I stated above this is mainly about getting customer requirements. Now requirements over here may be of customers, supervisor, etc. But say you have a dataset on which you want to do a data science project on so where are the requirements.
I think requirements are ones that you start the project with like—What do you gain from doing this project? How applicable this project is in the real world? etc.
Then, what kind of challenges could you face during this task?
The most important and the most difficult one(in my honest opinion) is — Understanding the business problem. It relates to just a simple question. Have You really understood the business problem?
Solution for the challenge:
The most effective and the easiest solution for the above problem would be to just Speak!
- Speak with your supervisor, mentor, etc.
- Try and understand the business problems.
- Note down all the requirements.
- Ask questions and doubts regarding the problem.
But doing just this isn’t enough!
Try and explain everything you have understood to your mentor or supervisor. They may correct few misunderstandings if there are any and this may help you to make a better data science project.
If you are doing the project solo and don’t have a mentor, speak with your friends. Ask for their inputs. You could also speak with your family and explain the problem to them. Take their help to gain a third-party perspective.
“Sometimes asking for help also means you are helping yourself.” – Renuka Pitre
So, now you have done Task 1. Let’s go to Task 2.
Task 2. Planning:
This step is mostly about planning your data science project. Like how much time are you going to require to do it? or what dataset do you require(if you don’t have it) or whether you are going to use supervised learning or unsupervised or reinforcement learning? etc. comes into this step.
So, what kind of challenges could you face during this task?
It is mostly related to questions that I stated above– how much time are you going to require to do it? or what dataset do you require(if you don’t have it)?
Solution for the challenges:
For the first question(how much time are you going to require to do it?), just plot a timeline chart. Decide how much time you are going to spend on data preprocessing, model evaluation, etc. Making a rough timeline chart can help because you would have a rough deadline for the completion of your project.
This is most effective for solo projects that students do(most just take it easy including me too). This helps us to learn time management and how to use our time effectively. For job/internship seekers, it also shows the ones hiring that you can use your time effectively to do projects.
As for the second question– this is mostly for students who are doing solo projects. Data is freely available nowadays. You can probably find it on Kaggle or Google, but remember to select the right dataset since there are many. If you cannot find it, then I would suggest learn web scraping or ask a friend who knows web scraping for help. The second option also shows that you are willing to work in a team.
Now onto the next task.
Task 3. Modelling:
In this step, you would be doing data preparation and gaining insights from the data which you are using. In short, Data Preprocessing and EDA or Exploratory Data Analysis come in this step.
Challenge and Solution?
There could be only two challenges — Proper Data Processing and EDA. For Data Preprocessing, do thorough preprocessing. Because the more clean your data is, the better your model is going to be.
There is not much challenge for EDA I guess. But as far as I have seen, doing a more in-depth EDA leads to a better Data Science project. Just a suggestion but you could do basic and quick EDA in Excel, Tableau, or Power Bi to understand some trends in data, and more in-depth in python and R.
Now let us go to the main part of the project.
Task 4. Construction:
Now I am not going to say much about it since you may already have understood what comes in this.
Your actual data science project happens in this step.
Like what model you are using? How much accuracy of the model should be? etc.
What could be challenged in this step?
There are many challenges and errors in this step but the most important one would be– Choosing the Right Algorithm.
Now, most of the small challenges and errors could be resolved by googling or stack-overflowing them(which we do when in doubt). The most important one– well that depends. It mostly depends on what type of relationship your data has between the feature and the target variables. It mostly helps if you try various models to find out which works the best.
Now, let’s go to the most important step in my opinion.
Task 5. Deployment:
You have now finished your project and you wish to send it to the client. This step mainly involves showing your client what you have done to improve their business.
And based on their feedback should you revise your model or not.
Then, what kind of challenges could you face during this task?
The only challenge I can think of would be– Communication of Results.
Well, most clients and stakeholders don’t know a lot of technical jargon, so explaining it technically to them wouldn’t help much. Explaining it in simple and layman language would help. Most effective would be using PPTs and the presentation of your results graphically.
For students doing solo projects, just make a web app using Flask or streamlit to explain your findings then deploy it on the web using Heroku.
We have now finished applying the software engineering approach to a data science project.
But, don’t forget the most important thing about the software engineering approach. It is not the process but the documentation.
Having documentation about your thinking and how you applied it in form of the above process, would help both you and your client.
We learned about software engineering, the process of software engineering, and how to apply this process for a more effective data science project.
That’s it from me.
About the Author
Currently, I am pursuing my Bachelor of Engineering (B.E) in Computer Engineering from Smt. Indira Gandhi College of Engineering, Mumbai. I am very enthusiastic about Data Science, Data Analytics, and Machine Learning.
You can connect with me on LinkedIn. Feel free to check me out and connect with me.
Your suggestions and doubts are welcomed here in the comment section. Thank you for reading my article!
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.