In today’s data-driven world, the ability to build scalable machine learning models has become increasingly important. With the exponential growth of data, traditional machine learning approaches are often not sufficient to handle the large datasets that many organizations are dealing with. This is where Apache Spark comes in, providing a powerful distributed computing framework that allows you to build and train machine learning models at scale. 

During this workshop, you will gain hands-on experience using Spark ML in Apache Spark to build and test different machine learning models. You will learn about the unique challenges and opportunities that arise when working with big data, including data preparation, feature engineering, and model selection. 

Workshop Highlights:

Module 0: Introduction to Spark

  • Why do we need distributed systems?
  • What is Apache Spark?
  • Understanding Spark Architecture
  • Installing and setting up PySpark

Module 1: Getting familiar with Spark

  • Understanding RDDs
  • Learn to create RDDs and get familiar with RDD operations
  • Handle structured data with Spark DataFrames

Module 2: Brushing up ML

  • What is Machine Learning
  • Types of ML: Supervised, Unsupervised, Reinforcement
  • Types of ML problems: Regression, Classification

Module 3: Spark ML

  • Understanding the problem statement
  • EDA
  • Encoding categorical variables
  • Understanding Vector Assembler
  • Model Building with Spark ML
  • Evaluating Models
  • Finetune models

Module 4: Building ML Pipelines

  • Understand Transformers
  • Understand Estimators
  • Build Pipelines in Spark ML


  • Laptop with minimum 8 GB of RAM
  • Knowledge of Python
  • Basic understanding of Machine Learning

Note: These are tentative details and are subject to change.

