DataHack Radio #22: Exploring Computer Vision and Data Engineering with Dat Tran
How do computer vision techniques work in an industry setting? How does an organization use data engineering to scale up its operations?
These are questions every aspiring data scientist must be aware of. Dat Tran, Head of Data Science at idealo internet GmbH, is the perfect person to shed light on these questions.
Dat has worked on a variety of data engineering projects before he came to idealo, and now leads a team of data scientists who work on really cool computer vision problems. This is one of my favorite episodes since we launched DataHack Radio – the depth and breadth of topics covered, plus Dat’s incredible knowledge, make this a must-listen.
In this episode of the DataHack Radio podcast, Kunal and Dat cover multiple topics, including:
- Dat’s not-so-straightforward journey into data science
- How his team uses computer vision at idealo
- His rich experience in data engineering
- Challenges faced with implementing models and building data pipelines
- Advice to aspiring data scientists, and much more!
I have penned down a few highlights from the podcast below. But I strongly recommend listening to the entire conversation! The energy Dat brings to this episode is incredible.
You can subscribe to the DataHack Radio podcast on any of the below platforms:
Dat Tran’s Background and Journey into Data Science
Dat’s journey into data science isn’t your run-of-the-mill story. He hadn’t even heard of ‘machine learning’ during his undergrad days, where his focus was on investment banking. But Dat quickly realized it wasn’t the field for him. So what next?
Back to the drawing board – a Master’s degree! During this time, a couple of his friends were starting out in machine learning and it wasn’t long before Dat was drawn into this wonderfully complex field.
He landed a job in the advanced analytics department at Accenture. This was back when ‘Big Data’ was starting to become the ultimate buzzword in the industry – a great time to enter this field. Dat moved to Pivotal Inc. a year later (joining as a data scientist), recognizing that this was a brilliant opportunity to get more hands-on experience in machine learning.
At Pivotal, Dat worked on a variety of projects spanning different industries, including automotive and airlines. He worked there for over two years and credits a lot of his current knowledge and experience to his time at Pivotal. He gave talks at multiple PyData conferences as well during this time – a truly impressive achievement.
Dat is now working as the Head of Data Science at idealo internet GmbH, a successful Berlin-based startup and one of the largest portals in the German e-commerce market.
Data Science at idealo – Focusing on Computer Vision
idealo is a price comparison site (for products as well as hotels) so you can imagine the numerous data science functions the team performs – price prediction, indexing, developing and using a recommendation engine, among other things. Dat’s team, however, focuses on applying computer vision.
A fair question to ask – what role does computer vision have in a price comparison site? Well, idealo has a ton of images of products and hotels:
Dat explained this section using a really intuitive example. idealo has approximately 2 million accommodations listed with 130 images per accommodation (on average). Now, there are all kinds of hotels – small-sized, medium ones, and the big players (the luxurious 5-star ones).
The pictures of these hotel rooms vary depending on who took them. The non-luxury hotels typically have images taken by owners themselves while the 5-star hotels send images taken by professionals. There is quite a big gap in the image quality between these two categories.
Dat and his team use an array of computer vision concepts to analyze and make use of these images:
- Image tagging: The algorithm essentially tags the image depending on the features – bedroom, bathroom, reception, etc.
- Image ordering: Then, this algorithm reorders the images in a visually pleasant way
- Another task Dat’s team does is upscaling images from low to high resolution using CV
Really interesting stuff! It’s a pleasure to see computer vision making inroads in the industry, isn’t it?
Data Engineering Experience
I came across Dat’s talk at PyData on YouTube – it doesn’t take long to realize he is a data engineering expert. His talk is on ‘How you really get your data science models into production the cool way!’ and you can check it out below:
At idealo, there are a variety of tools being used for data engineering, such as AWS for training and Kubernetes for putting models into production.
I personally feel data engineering is a very overlooked aspect (by aspiring data scientists) of the overall data science project lifecycle. You will most certainly face questions on model deployment and other aspects of software engineering in your data scientist interview. This section of the podcast will provide you with a bird’s eye view of an industry-ready process.
Challenges Faced in Implementing Data Science and Data Engineering
Data science and data engineering are inextricably linked – you cannot separate them for all intents and purposes. Dat explained this using the example of a neural network (a convolutional neural network (CNN), to be precise). There are quite a few CNN frameworks to choose from, like RESNET, MobileNET, VGG, etc.
The challenge with these CNN models is they have tons of hyperparameters, hence making them quite large. This brings up the age-old debate of balancing accuracy and speed. You can get away with it in research but when you’re working with production environments? That is a significant obstacle.
Dat mentioned quite a few common challenges from the data engineering specific side as well, including:
“How can we use a Keras trained model on a TensorFlow backend?”
“Do we need to transform our images into certain formats?”
“How do we benchmark our model results?”
Keeping yourself Updated on the Latest Data Science Techniques
“You best learn about these things when you do them yourself.”
As we alluded to earlier, Dat has done most of his data science learning on the job. There is nothing like practical hands-on experience to indelibly ingrain concepts.
Outside of that, there are so many options to learn from these days (everything is a quick Google search away!):
- Blog posts
- Podcasts, etc.
A major challenge with these platforms is that we don’t get a structured path or answer to a specific problem. That, again, is why experience is king in data science.
Advice to Aspiring Data Science Professionals
Software engineering is a key facet of data science most aspiring professionals are unaware of. And you simply can’t get away from it in an industry role. So here’s Dat’s advice for you:
“You need kind of an engineering background. Learn the basics – how to write clean code, version control, testing, and move on to data science then.”\
And this really, REALLY important point:
“The Machine Learning aspect is a small part of a big software project!”
Knowing mathematics, statistics, machine learning algorithms and even tools like R and Python is good, but these don’t differentiate you from the competition. Everyone else is learning the same thing. So what else is there? It comes down to that one thing again – software engineering.
Dat’s Data Science Hiring Process
Dat uses a straightforward set of pointers and rounds to judge a candidate’s ability:
- 10 basic machine learning questions: Most people drop off at this stage
- A machine learning assignment
- On-site interview: This includes working with a member of Dat’s data science team to solve a problem
- How well does the candidate write and document code?
- Ability to research and the thought process behind it
Future Trends in Machine Learning
Which machine learning functions will see a major improvement and focus in the coming years?
- AutoML will continue to gain market share and become an accepted member of the machine learning tool family
- Explainable AI: The ability to build interpretable deep learning models will take on far more importance
- There will be a far bigger focus on security and governance
One of my favorite DataHack Radio episodes so far! Dat brings a ton of enthusiasm and knowledge to the podcast that really shines through in the way he explains his role, the challenges his team faces from both a data science as well as a data engineering perspective, his advice to aspiring data scientists, among other things.
A pleasure listening to him elaborate on relevant industry problems and how to overcome them. What was your favorite part of the episode? Let us know in the comments section below.