- Intela has launched Farrago, an online tool to deal with dirty data
- It claims to be able to deal with data of any kind and in any context
- A demo is available on the site to test the tool
It’s well documented that cleaning up dirty data takes up the majority of a data scientists’s time. It’s a cumbersome and often tiring process, but it has to be dealt with before the exploration and model building stages. With the recent rise in different sources of data collection, the amount of data is at an all-time high. And it’s only going to increase.
Wellington based company Intela wants to help companies deal with this “dirty data”. They have launched an online tool, called “Farrago”, that uses machine learning to clean up messy data. The company claims that the tool can deal with data coming from any source and in any context.
For demo purposes, the online tool let’s you either upload your own dataset or select from a pre-uploaded dataset. Then, Farrago will suggest meaningful fields (based on initial data analysis) to use in duplication searching.
It learns from the responses to enhance it’s accuracy on your data. The more training the better. Intela recommends a minimum of ten of each type of response. It then organizes everything and allows you to analyse and download the cleaned dataset.
You can take a demo run of the tool here to get a general idea of how it works.
Our take on this
This could really help advance a lot of initiatives in the field of data science. Data scientists spend thousands of hours trying to deal with and clean up messy data before it can be ready to be analysed and models built on it.
In fact, even with the advancements in technology, data cleaning has remained a cumbersome and tricky step. There are times when dirty data holds back the entire data science life-cycle process in an organization.
If you, as a data scientist, are unable to clean up the dirty data, or miss a few components in your haste to get to the model building part, it will lead to the wrong computations. Dirty data in will inevitably lead to dirty data out.
There have been previous attempts at automating the data cleaning process – including a popular python library called “datacleaner”. But they have reported bugs and have not caught on in the community. Here’s hoping Farrago goes a long way in solving this oft-mentioned issue!
Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!