Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analysing data much easier. Pandas has a great and familiar API based on the familiar pythonic dot syntax. Intelligent label-based slicing, fancy indexing, and sub-setting of large data sets, intuitive merging and joining data sets make pandas very powerful for data analysis. Also the pandas community is very strong and if you are a researcher it is very easy to find pandas code on the web related to any kind of transformation that you want.
Now the same advantages that make pandas such a great use case for data analysis also sometimes act as a barrier when you want to run the same transformations in production. When writing production code you want your computation to be more distributed and hence the best tool for that is Spark. The disadvantage with traditional spark code was that it was RDD based and hence you needed to be familiar with the functional paradigms. Sometimes transforming pandas code to spark code was very difficult as it was not very apparent how to make the code changes so that the dataframe based code that was written in pandas was not easily translatable to the spark based API.
It has changed with the advent of higher APIs in spark which are called, guess what, dataframes and datasets, which came into the picture with Spark 1.6. In this hack session we will go through common pandas transformations on some popular datasets and will see how similar the spark dataframe transformation are to the pandas transformations. You as a developer will need to remember less code and less ways of thinking.
Apart from these we will take a look at some basic spark concepts that will help us write better Spark transformations.
Lastly for the advanced users who have always wanted more from pandas we will take a look at the map, flatmap, lazy kind of computations that are possible now in scala using the spark datasets.
Some of the concepts that we will discuss in this hack session are detailed below.
- A quick word on spark context.
- A look at Scala, sbt and spark tools such as spark-submit, spark-shell, pyspark.
- Spark paradigms: Transformation and Action.
- Difference: creating a pandas DF and a Spark DF.
- What happens under the hood of a spark dataframe.
- Having a more distributed mindset.
- Common pandas transformations and their spark counterparts.
- Understanding the shape of the matrix.
- Missing values imputation.
- Filtering data.
- Orderby and groupby’s
- Merges and joins.
- Function application, transformations and mapping.
- The spark dataframes, pipeline API.
- A look at the datasets.
Joydeep Bhattacharjee is a Principal Engineer who works for Nineleaps Technology Solutions. After graduating from the National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he stumbled upon machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He also likes writing about various machine learning related stuff.
Duration of Hack-Session: 1 hour