Classification using Pyspark, DataBricks, and Koalas
This article was published as a part of the Data Science Blogathon.
The volume of data collected worldwide has drastically increased over the past decade. Nowadays, data is continuously generated if we open an app, perform a Google search, or simply move from place to place with our mobile devices. This data generation has boosted business decision-making and operational efficiencies whilst advancing the field of predictive analytics. As a result, businesses and organizations must manage, store, visualize, and analyze massive volumes of important data.
Several specialized big data software and architecture solutions have been developed to handle this level of complexity and volume. This article will use big data tools like PySpark MLlib and Koalas to solve a classification problem.
What is PySpark and Koalas?
Big data platforms are especially created to handle enormous volumes of data that enter the system at high velocities and in a broad variety. These big data platforms often include a variety of servers, databases, and business intelligence tools that enable data scientists to manipulate data to discover trends and patterns.
PySpark is one such tool to handle large-scale data processing. MLlib is Spark’s machine learning (ML) library and is a wrapper over PySpark. To store and process data, this library employs the data parallelism technique.
Data scientists use pandas to explore data. Although it is one of the best tools for handling and analyzing data, it cannot handle large amounts of data. When their data volume increases, they need to choose and learn a new system from scratch, such as Apache Spark, to adapt and transform their current workload. Koalas bridges the gap by providing pandas-like APIs that work with Apache Spark. Because Koalas supports many operations that are difficult to execute with PySpark, such as visualizing data directly from a PySpark DataFrame, it is valuable not only for pandas users but also for PySpark users.
How to use Koalas with Pyspark?
Carrying out exploratory data analysis, feature engineering, and model construction are the norm while developing machine learning models. I’ll try to touch on the aforementioned topic with Koalas. Pandas API is reportedly used by the majority of Data scientists for feature engineering and EDA tasks. I’ll show how to utilize Koalas to complete these tasks and PySpark to create the model.
I’ll use the Bank marketing campaigns dataset. It is a dataset that describes the outcomes of Portugal’s bank marketing activities. Campaigns were generally conducted through direct phone conversations, offering bank clients the opportunity to put a term deposit. If the client decides to put in a deposit after all marking efforts, the target variable is marked ‘yes,’ otherwise ‘no.’ We will create a model to predict the client’s potential to sign up for a bank term deposit.
Import the necessary libraries
To install PySpark, you can read the official documentation.
To install Koalas use:
! pip install koalas
PySpark is not present on the system path by default. You can add it to the system path at runtime to make it importable. Use findspark to do this task.
import findspark findspark.init() findspark.find()
Now you can import PySpark.
from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import RandomForestClassifier , LogisticRegression
Loading the dataset using Koalas
data = ks.read_csv('bank-additional-full.csv',sep=";") data.head(10)
We can easily use Koalas with Seaborn to generate graphs by converting the Koalas dataframe to a NumPy array.
data['y'] = data['y'].apply(lambda x: 0 if x=='no' else 1)
sns.countplot(data['y'].to_numpy()) plt.title('Deposit Distributions n (0: No || 1: Yes)', fontsize=14)
Similar to examining the data with Pandas, we can do the same with Koalas.
Moving further, we will create a list of categorical variables by selecting the columns with an object data type. Once we have a set of categorical variables, we will use get_dummies() to turn them into indicator variables.
#get the list of categorical columns cat_col=[col for col in data.columns.values if data[col].dtype=='object']
# sepearting the numerical and categorical feature df_cat=data[cat_col] df_num= data.drop(cat_col,axis=1)
df_features.columns = [col.replace('.','_') for col in df_features.columns]
We will now select features with a correlation greater than a certain threshold.
correlations = df_features.corr() all_features = correlations['y'].index.to_list() corr_y = correlations['y'].values selected_features =  for i in range(len(corr_y)): if abs(corr_y[i]) > 0.01: selected_features.append(all_features[i]) print(all_features[i]) df_features = df_features[selected_features]
Now we will scale the features using a MinMax scaler.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() data_standardised = scaler.fit_transform(df_features.values) df_std = ks.DataFrame(data_standardised) df_std.columns = selected_features
Creating Feature Vectors
Now that we have created the features with the help of Koalas, we will convert the Koalas DataFrame into a Spark Data frame. Then, using PySpark’s VectorAssembler, we’ll build a feature vector to train the model.
df =df_std.to_spark() features = df_features[selected_features].columns.to_list() features.remove('y') vectorAssembler = VectorAssembler(inputCols = features, outputCol = "vectors") vpp_sdf = vectorAssembler.transform(df) #to see data in jupyter notebook from IPython.display import HTML v = vpp_sdf.select('vectors').toPandas().head() HTML(v.to_html())
Train Test Split
splits = vpp_sdf.randomSplit([0.7,0.3]) train_df = splits test_df = splits
Training PySpark Model
We will use MLlib’s Random forest classifier. It takes two arguments: featuresCol (the feature vector) and labelCol (the label to predict).
rfc = RandomForestClassifier(labelCol="y", featuresCol="vectors") rfc_model = rfc.fit(train_df) predictions = rfc_model.transform(test_df)
Evaluate the model
We have two options for evaluating the model: utilize PySpark’s Binary classification evaluator, convert the predictions to a Koalas dataframe and use sklearn to evaluate the model. One advantage of using the latter is that we can easily visualize the results.
Using PySpark, we can find the area under ROC with the help of this code:
evaluator = BinaryClassificationEvaluator(labelCol="y", rawPredictionCol="prediction", metricName='areaUnderROC') predictionAndTarget = predictions.select("y", "prediction") evaluator.evaluate(predictionAndTarget)
Convert a Spark DataFrame to a Koalas DataFrame and simply compute the confusion matrix.
#convert spark dataframe to koalas dataframe kdf_predictions = ks.DataFrame(predictions)
cm = confusion_matrix(kdf_predictions['y'].to_numpy(),kdf_predictions['prediction'].to_numpy())
sns.heatmap(cm, annot=True, cmap='viridis',fmt='g')
Koalas can help scale our Pandas code on Spark without learning a new framework from scratch by swapping out one package with another. In this article, we build a binary classifier using Pyspark. The following are some major takeaways from the article:
- We learned how to use Koalas to process the data.
- We created feature vectors using PySpark’s VectorAssembler.
- Finally, we used a Random Forest classifier to train our model and evaluated the model using different methods.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Leave a Reply Your email address will not be published. Required fields are marked *