What Are the Best Practices for Deploying PySpark on AWS?

Prashant Malge 17 Nov, 2023 β€’ 15 min read


In big data and advanced analytics, PySpark has emerged as a powerful tool for processing large datasets and analyzing distributed data. Deploying PySpark on AWS applications on the cloud can be a game-changer, offering scalability and flexibility for data-intensive tasks. Amazon Web Services (AWS) provides an ideal platform for such deployments, and when combined with Docker containers, it becomes a seamless and efficient solution.


However, deploying PySpark on a cloud infrastructure can be complex and daunting. The intricacies of setting up a distributed computing environment, configuring Spark clusters, and managing resources often deter many from harnessing their full potential.

Learning Objectives

  • Learn the fundamental concepts of PySpark, AWS, and Docker, ensuring a solid foundation for deploying PySpark clusters on the cloud.
  • Follow a comprehensive, step-by-step guide to set up PySpark on AWS using Docker, including configuring AWS, preparing Docker images, and managing Spark clusters.
  • Discover strategies for optimizing PySpark performance on AWS, including monitoring, scaling, and adhering to best practices to make the most of your data processing workflows.

This article was published as a part of the Data Science Blogathon.



Before embarking on the journey to deploy PySpark on AWS using Docker, ensure that you have the following prerequisites in place:

πŸš€ Local PySpark Installation: To develop and test PySpark applications, it’s essential to have PySpark installed on your local machine. You can install PySpark by following the official documentation for your operating system. This local installation will serve as your development environment, allowing you to write and test PySpark code before deploying it on AWS.

🌐 AWS Account: You’ll need an active AWS (Amazon Web Services) account to access the cloud infrastructure and services required for PySpark deployment. You can sign up on the AWS website if you don’t have an AWS account. Be prepared to provide your payment information, although AWS offers a free tier with limited resources for new users.

🐳 Docker Installation:  Docker is a pivotal component in this deployment process. Install Docker on your local machine by following the installation instructions for the Ubuntu operating system. Docker containers will allow you to encapsulate and deploy your PySpark applications consistently.


  1. Visit the
  2. Download the Docker Desktop for Windows installer.
  3. Double-click the installer to run it.
  4. Follow the installation wizard’s instructions.
  5. Once installed, launch Docker Desktop from your applications.


  1. Head to the
  2. Download the Docker Desktop for Mac installer.
  3. Double-click the installer to open it.
  4. Drag the Docker icon to your Applications folder.
  5. Launch Docker from your Applications.

Linux (Ubuntu)

1. Open your terminal and update your package manager:

sudo apt-get update

2. Install necessary dependencies:

sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common

3. Add Docker’s official GPG key:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | 
 sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

4. Set up the Docker repository:

echo "deb [signed-by=/usr/share/keyrings/docker-archive-keyring.gpg]
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | 
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

5. Update your package index again:

sudo apt-get update

6. Install Docker:

sudo apt-get install -y docker-ce docker-ce-cli containerd.io

7. Start and enable the Docker service:

sudo systemctl start docker
sudo systemctl enable docker

8. Verify the installation:

sudo docker --version

**** Add split lines in one line

Watch a video tutorial on Docker installation

Setting Up AWS

Amazon Web Services (AWS) is the backbone of our PySpark deployment, and we’ll use two essential services, Elastic Container Registry (ECR) and Elastic Compute Cloud (EC2), to create a dynamic cloud environment.


AWS Account Registration

If you haven’t already, head to the AWS sign-up page to create an account. Please follow the registration process, provide the necessary information, and be ready with your payment details if you’d like to explore beyond the AWS Free Tier.

AWS Free Tier

For those new to AWS, take advantage of the AWS Free Tier, which offers limited resources and services at no cost for 12 months. This is an excellent way to explore AWS without incurring charges.

AWS Access Key and Secret Key

You’ll need an Access Key ID and Secret Access Key to interact with AWS programmatically. Follow these steps to generate them:

  • Log in to the AWS Management Console.
  • Navigate to the Identity & Access Management (IAM) service.
  • Click on “Users” in the left navigation pane.
  • Create a new user or select an existing one.
  • Under the “Security credentials” tab, generate an Access Key.
  • Note down the Access Key ID and Secret Access Key, as we’ll use them later
  • After click user

Elastic Container Registry (ECR)

ECR is a managed Docker container registry service provided by AWS. It will be our repository for storing Docker images. You can set up your ECR by following these steps:

  • In the AWS Management Console, navigate to the Amazon ECR service.
  • Please create a new repository, give it a name, and configure the repository settings.
  • Note down the URI of your ECR repository; you’ll need it for Docker image pushes.

Elastic Compute Cloud (EC2)

EC2 provides scalable computing capacity in the cloud and will host your PySpark applications. To set up an EC2 instance:

  • In the AWS Management Console, navigate to the EC2 service.
  • Launch a new EC2 instance, choosing the instance type that suits your workload.
  • Configure the instance details and storage options.
  • Create or select an existing key pair to securely connect to your EC2 instance.


Watch a video tutorial

Storing Your AWS Setup Values for Future Use

AWS_ECR_LOGIN_URI: 123456789012.dkr.ecr.region.amazonaws.com
AWS_REGION: us-east-1
ECR_REPOSITORY_NAME: your-ecr-repository-name

Setting Up GitHub Secrets and Variables

Now that you have your AWS setup values ready, it’s time to securely configure them in your GitHub repository using GitHub secrets and variables. This adds an extra layer of security and convenience to your PySpark deployment process.

Follow these steps to set up your AWS values:

Access Your GitHub Repository

  • You can just navigate to your GitHub repository, where you’re hosting your PySpark project.

Access Repository Settings

  • Inside your repository, click on the “Settings” tab.

Secrets Management

  • In the left sidebar, you’ll find an option called “Secrets.” Click on it to access the GitHub secrets management interface.

Add a New Secret

  • Here, you can add your AWS setup values as secrets.
  • Click on “New Repository Secret” to create a new secret.
  • For each AWS value, create a secret with a name that corresponds to the value’s purpose (e.g., “AWS_ACCESS_KEY_ID,” “AWS_SECRET_ACCESS_KEY,” “AWS_REGION,” etc.).
  • Enter the actual value in the “Value” field.

Save Your Secrets

  • Click the “Add secret” button for each value to save it as a GitHub secret.

With your AWS secrets securely stored in GitHub, you can easily reference them in your GitHub Actions workflows and securely access AWS services during deployment.

Best Practice

  • GitHub secrets are encrypted and can only be accessed by authorized users with the necessary permissions. This ensures the security of your sensitive AWS values.
  • Using GitHub secrets, you avoid exposing sensitive information directly in your code or configuration files, enhancing your project’s security.

Your AWS setup values are now safely configured in your GitHub repository, making them readily available for your PySpark deployment workflow.

Understanding the Code Structure

To effectively deploy PySpark on AWS using Docker, it’s essential to grasp the structure of your project’s code. Let’s break down the components that make up the codebase:

β”œβ”€β”€ .github
β”‚   β”œβ”€β”€ workflows
β”‚   β”‚   β”œβ”€β”€ build.yml
β”œβ”€β”€ airflow
β”œβ”€β”€ configs
β”œβ”€β”€ consumerComplaint
β”‚   β”œβ”€β”€ cloud_storage
β”‚   β”œβ”€β”€ components
β”‚   β”œβ”€β”€ config
β”‚   β”‚   β”œβ”€β”€ py_sparkmanager.py
β”‚   β”œβ”€β”€ constants
β”‚   β”œβ”€β”€ data_access
β”‚   β”œβ”€β”€ entity
β”‚   β”œβ”€β”€ exceptions
β”‚   β”œβ”€β”€ logger
β”‚   β”œβ”€β”€ ml
β”‚   β”œβ”€β”€ pipeline
β”‚   β”œβ”€β”€ utils
β”œβ”€β”€ output
β”‚   β”œβ”€β”€ .png
β”œβ”€β”€ prediction_data
β”œβ”€β”€ research
β”‚   β”œβ”€β”€ jupyter_notebooks
β”œβ”€β”€ saved_models
β”‚   β”œβ”€β”€ model.pkl
β”œβ”€β”€ tests
β”œβ”€β”€ venv
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .dockerignore

Application Code (app.py)

  • app.py is your main Python script responsible for running the PySpark application.
  • It’s the entry point for your PySpark jobs and serves as the core of your application.
  • You can customize this script to define your data processing pipelines, job scheduling, and more.


  • The Dockerfile contains instructions for building a Docker image for your PySpark application.
  • It specifies the base image, adds necessary dependencies, copies application code into the container, and sets up the runtime environment.
  • This file plays a crucial role in containerizing your application for seamless deployment.

Requirements (requirements.txt)

  • requirements.txt lists the Python packages and dependencies required for your PySpark application.
  • These packages are installed within the Docker container to ensure your application runs smoothly.

GitHub Actions Workflows

  • GitHub Actions workflows are defined in .github/workflows/ within your project repository.
  • They automate the build, testing, and deployment processes.
  • Workflow files, such as main.yml, outline the steps to execute when specific events occur, such as code pushes or pull requests.

Build  py_sparkmanager.py

import os
from dotenv import load_dotenv
from pyspark.sql import SparkSession

# Load environment variables from .env

access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")

# Initialize SparkSession
spark_session = SparkSession.builder.master('local[*]').appName('consumer_complaint') \
    .config("spark.executor.instances", "1") \
    .config("spark.executor.memory", "6g") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memoryOverhead", "8g") \
    .config('spark.jars.packages', "com.amazonaws:aws-java-sdk:1.7.4,
	org.apache.hadoop:hadoop-aws:2.7.3") \

# Configure SparkSession for AWS S3 access
spark_session._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_key_id)
spark_session._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", secret_access_key)
spark_session._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark_session._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "ap-south-1.amazonaws.com")
spark_session._jsc.hadoopConfiguration().set("fs.s3.buffer.dir", "tmp")

This code sets up your SparkSession, configures it for AWS S3 access, and loads AWS credentials from environment variables, allowing you to work with AWS services seamlessly in your PySpark application

Preparing PySpark Docker Images(IMP)

This section will explore how to create Docker images that encapsulate your PySpark application, making it portable, scalable, and ready for deployment on AWS. Docker containers provide a consistent environment for your PySpark applications, ensuring seamless execution in various settings.


The key to building Docker images for PySpark is a well-defined Dockerfile. This file specifies the instructions for setting up the container environment, including Python and PySpark dependencies.

FROM python:3.8.5-slim-buster
# Use an Ubuntu base image
FROM ubuntu:20.04

# Set JAVA_HOME and install OpenJDK 8
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
RUN apt-get update -y \
    && apt-get install -y openjdk-8-jdk \
    && apt-get install python3-pip -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Set environment variables for your application
ENV AIRFLOW_HOME="/app/airflow"
ENV PYSPARK_PYTHON=/usr/bin/python3

# Create a directory for your application and set it as the working directory

# Copy the contents of the current directory to the working directory in the container
COPY . /app

# Install Python dependencies from requirements.txt
RUN pip3 install -r requirements.txt

# Set the entry point to run your app.py script
CMD ["python3", "app.py"]

Building the Docker Image

Once you have your Dockerfile ready, you can build the Docker image using the following command:

docker build -t your-image-name

Replace your-image-name with the desired name and version for your Docker image.

Verifying the Local Image

After building the image, you can list your local Docker images using the following command:

docker images

docker ps -a 

docker system df

Running PySpark in Docker

With your Docker image prepared, you can go ahead and run your PySpark application in a Docker container. Use the following command:

docker run -your-image-name


docker run 80:8080 your-image-name

docker run 8080:8080 your-image-name

Deploying PySpark on AWS

This section will walk through deploying your PySpark application on AWS using Docker containers. This deployment will involve launching Amazon Elastic Compute Cloud (EC2) instances for creating a PySpark cluster.

Launch EC2 Instances

  • In the EC2 Dashboard, click “Launch Instances.”
  • You can select an Amazon Machine Image (AMI) that suits your needs, often Linux-based.
  • Depending on your workload, choose the instance type (e.g., m5.large, c5.xlarge).
  • Configure instance details, including the number of instances in your cluster.
  • Add storage, tags, and security groups as needed.

This is all I mentioned above.

Connect to EC2 Instances

  • Once the instances run, you can SSH into them to manage your PySpark cluster.

Write the Below Command

Download the Docker installation script

curl -fsSL https://get.docker.com -o get-docker.sh

Run the Docker installation script with root privileges

sudo sh get-docker.sh

Add the current user to the docker group (replace ‘ubuntu’ with your username)

sudo usermod -aG docker ubuntu

Activate the changes by running a new shell session or using ‘newgrp’

newgrp docker

Building a GitHub Self-Hosted Runner

We’ll set up a self-hosted runner for GitHub Actions, responsible for executing your CI/CD workflows. A self-hosted runner runs on your infrastructure and is a good choice for running workflows that require specific configurations or access to local resources.

Setting Up the Self-Hosted Runner

  • Click Setting
  • Click Action -> Runner
  • Click New self-hosted Runner

Write the below command on the EC2 machine

  • Create a folder: This command creates a directory named actions-runner and changes the current directory to this newly created folder.
$ mkdir actions-runner && cd actions-runner
  • Download the latest runner package: This command downloads the GitHub Actions runner package for Linux x64. It specifies the URL of the package to download and saves it with the filename actions-runner-linux-x64-2.309.0.tar.gz.
$ curl -o actions-runner-linux-x64-2.309.0.tar.gz -L 
  • Optional: Validate the hash: This command checks the integrity of the downloaded package by validating its hash. It computes the SHA-256 hash of the downloaded package and compares it to a known, expected hash. If they match, the package is considered valid.
$ echo "2974243bab2a282349ac833475d241d5273605d3628f0685bd07fb5530f9bb1a
  actions-runner-linux-x64-2.309.0.tar.gz" | shasum -a 256 -c
  • Extract the installer: This command extracts the contents of the downloaded package, which is a tarball (compressed archive).
$ tar xzf ./actions-runner-linux-x64-2.309.0.tar.gz
  • Last step, run it: This command starts the runner with the provided configuration settings. It sets up the runner to execute GitHub Actions workflows for the specified repository.
$ ./run.sh

Continuous Integration and Continuous Delivery (CICD) Workflow Configuration

In a CI/CD pipeline, the build.yaml file is crucial in defining the steps required to build and deploy your application. This configuration file specifies the workflow for your CI/CD process, including how code is built, tested, and deployed. Let’s dive into the critical aspects of the build.yaml configuration and its importance:

Workflow Overview

The build.yaml file outlines the tasks executed during the CI/CD pipeline. It defines the steps for continuous integration, which involves building and testing your application and continuous delivery, where the application is deployed to various environments.

Continuous Integration (CI)

This phase typically includes tasks like code compilation, unit testing, and code quality checks. The build.yaml file specifies the tools, scripts, and commands required to perform these tasks. For example, it might trigger the execution of unit tests to ensure code quality.

Continuous Delivery (CD)

After successful CI, the CD phase involves deploying the application to different environments, such as staging or production. The build.yaml file specifies how the deployment should happen, including where and when to deploy and which configurations to use.


Dependency Management

The build.yaml file often includes details about project dependencies. It defines where to fetch external libraries or dependencies from, which can be crucial for the successful build and deployment of the application.

Environment Variables

CI/CD workflows often require environment-specific configurations, such as API keys or connection strings. The build.yaml file may define how these environment variables are set for each pipeline stage.

Notifications and Alerts

In case of failures or issues during the CI/CD process, notifications and alerts are essential. The build.yaml file can configure how and to whom these alerts are sent, ensuring that problems are addressed promptly.

Artifacts and Outputs

Depending on the CI/CD workflow, the build.yaml file may specify what artifacts or build outputs should be generated and where they should be stored. These artifacts can be used for deployments or further testing.

By understanding the build.yaml file and its components, you can effectively manage and customize your CI/CD workflow to meet the needs of your project. It is the blueprint for the entire automation process, from code changes to production deployments.

CI/CD Pipeline

You can customize the content further based on the specific details of your build.yaml configuration and how it fits into your CI/CD pipeline.

name: workflow

      - main
      - 'README.md'

  id-token: write
  contents: read

    name: Continuous Integration
    runs-on: ubuntu-latest
      - name: Checkout Code
        uses: actions/checkout@v3

      - name: Lint code
        run: echo "Linting repository"

      - name: Run unit tests
        run: echo "Running unit tests"

    name: Continuous Delivery
    needs: integration
    runs-on: ubuntu-latest
      - name: Checkout Code
        uses: actions/checkout@v3

      - name: Install Utilities
        run: |
          sudo apt-get update
          sudo apt-get install -y jq unzip
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build, tag, and push image to Amazon ECR
        id: build-image
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          ECR_REPOSITORY: ${{ secrets.ECR_REPOSITORY_NAME }}
          IMAGE_TAG: latest
        run: |
          # Build a docker container and
          # push it to ECR so that it can
          # be deployed to ECS.
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY
    needs: build-and-push-ecr-image
    runs-on: self-hosted
      - name: Checkout
        uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1
      - name: Pull latest images
        run: |
         docker pull ${{secrets.AWS_ECR_LOGIN_URI}}/${{ secrets.
         ECR_REPOSITORY_NAME }}:latest

      - name: Stop and remove sensor container if running
        run: |
         docker ps -q --filter "name=sensor" | grep -q . && docker stop sensor 
		 && docker rm -fv sensor
      - name: Run Docker Image to serve users
        run: |
         docker run -d -p 80:8080 --name=sensor -e 'AWS_ACCESS_KEY_ID=
                  ${{ secrets.AWS_ACCESS_KEY_ID }}
		 -e 'AWS_REGION=${{ secrets.AWS_REGION }}'  ${{secrets.AWS_ECR_LOGIN_URI}}/
		 ${{ secrets.ECR_REPOSITORY_NAME }}:latest

      - name: Clean previous images and containers
        run: |
         docker system prune -f

Note: All Split line join as one
If any issue occurs then follow the GitHub repo I mentioned last.

Continuous-Deployment Job:

  • This job depends on the “Build-and-Push-ECR-Image Job” and is configured to run on a self-hosted runner.
  • It checks out the code and configures AWS credentials.
  • It logs in to Amazon ECR.
  • It pulls the latest Docker image from the specified ECR repository.
  • It stops and removes a Docker container named “sensor” if it’s running.
  • It runs a Docker container named “sensor” with the specified settings, environment variables, and the Docker image pulled earlier.
  • Finally, it cleans up previous Docker images and containers using docker system prune.

Automate Workflow Execution on Code Changes

To make the entire CI/CD process seamless and responsive to code changes, you can configure your repository to trigger the workflow upon code commits automatically or pushes. Every time you save and push changes to your repository, the CI/CD pipeline will start working its magic.

By automating the workflow execution, you ensure that your application remains up-to-date with the latest changes without manual intervention. This automation can significantly improve development efficiency and provide rapid feedback on code changes, making it easier to catch and resolve issues early in the development cycle.

To set up automated workflow execution on code changes, follow these steps:

git add .

git commit -m "message"

git push origin main


In this comprehensive guide, we’ve walked you through the intricate process of deploying PySpark on AWS using EC2 and ECR. Utilizing containerization and continuous integration and delivery, this approach provides a robust and adaptable solution for managing large-scale data analytics and processing tasks. By following the steps outlined in this blog, you can harness the full power of PySpark in a cloud environment, taking advantage of the scalability and flexibility AWS offers.

It’s important to note that AWS presents many deployment options, from EC2 and ECR to specialized services like EMR. The choice of method ultimately depends on the unique requirements of your project. Whether you prefer the containerization approach demonstrated here or opt for a different AWS service, the key is to leverage the capabilities of PySpark effectively in your data-driven applications. With AWS as your platform, you’re well-equipped to unlock the full potential of PySpark, ushering in a new era of data analytics and processing. Explore services like EMR if they align better with your specific use cases and preferences, as AWS provides a diverse toolkit for deploying PySpark to meet the unique needs of your projects.

Key Takeaways

  • Deploying PySpark on AWS with Docker streamlines big data processing, offering scalability and automation.
  • GitHub Actions simplify the CI/CD pipeline, enabling seamless code deployment.
  • Leveraging AWS services like EC2 and ECR ensures robust PySpark cluster management.
  • This tutorial equips you to harness the power of cloud computing for data-intensive tasks.

Frequently Asked Questions

Q1. What is PySpark, and why use it with AWS?

A. PySpark is the Python library for Apache Spark, a robust extensive data processing framework. Deploying PySpark on AWS offers scalable and flexible solutions for data-intensive tasks, making it an ideal choice for distributed data analysis.

Q2: Do you know if I can run PySpark locally, or is the cloud deployment necessary?

A. While you can run PySpark locally, cloud deployment is recommended for handling large datasets efficiently. AWS provides the infrastructure and tools needed for scaling PySpark applications.

Q3: How do I secure sensitive AWS credentials in my CI/CD pipeline?

A. Use GitHub Secrets to store AWS credentials and securely access them in your workflow. This ensures your credentials remain protected and are not exposed in your code.

Q4: What are the key benefits of using Docker containers in the PySpark deployment?

A. Docker containers offer a consistent environment across different platforms, ensuring your PySpark application runs the same way in development, testing, and production. They also simplify the process of building and deploying PySpark applications.

Q5: What are the cost implications of running PySpark on AWS?

The cost of running PySpark on AWS depends on various factors, including the type and number of EC2 instances used, data storage, data transfer, and more. Monitoring your AWS usage and optimizing resources to manage costs efficiently is essential.

Resources for Further Learning

  • GitHub Repository: Access the complete source code and configurations used in this tutorial on the Consumer Complaint Dispute Prediction GitHub repository.
  • Docker Documentation: Dive deeper into Docker and containerization by exploring the official Docker documentation. You’ll find comprehensive guides, best practices, and tips to master Docker.
  • GitHub Actions Documentation: Unleash the full power of GitHub Actions by referring to the GitHub Actions documentation. This resource will help you create, customize, and automate your workflows.
  • PySpark Official Documentation: For in-depth knowledge of PySpark, you can just explore the official PySpark documentation. Learn about APIs, functions, and libraries for big data processing.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Prashant Malge 17 Nov 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

  • [tta_listen_btn class="listen"]