How to Encrypt and Decrypt the Data in PySpark?

Kishan Yadav 10 Jan, 2023 • 6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Data sharing has become so easy today, and we can share the details with just a few clicks. To access services, we need to share essential details like email IDs, phone numbers, social security numbers, etc. These details can get leaked if the service provider doesn’t follow a robust data protection methodology. Many data breaches happen due to negligent or accidental exposure, which may impact the user personally, professionally, or economically. We have our email ids, phone numbers, and government-issued cards, which are sensitive and confidential. We must protect them so they can’t get into the wrong hands.

In this article, we will work on two different methods to encrypt these data so that they can’t get into the hands of unauthorized users. We will see how we can encrypt and decrypt the sensitive data using PySpark.

encrypt data
Source: Canva

 

Why is There a Need for Data Encryption?

Data encryption is essential in several contexts. Suppose an organization that deals with different clients has to share the data to provide services to them. Clients share their confidential details with firms like their database, customer info, products they sell or purchase, etc.

All these details are sensitive and must be protected so they can’t get into the wrong hands. If unauthorized individuals access these data, it can lead to severe consequences such as financial loss, reputational damage, or even legal liabilities.

So data encryption helps us to protect sensitive and confidential information. It is a very crucial aspect of data security.

encrypt data
Source: Canva

Data Frame Creation

To perform encryption and decryption, we need sample data with essential information like user email id, phone number, social security number, address, etc. Before sending these details to a user, they need to be encrypted. So, we will create a sample dataframe that has this information. The dataframe has four columns named ‘customer_name’, ‘mail_id’, ‘phone_num’, and ‘social_security_number’. The column’s descriptions are as follows:-

  • customer_name:- This column contains the customer’s names.
  • mail_id:- This column has customer email information.
  • phone_num:- This column has the customer’s mobile numbers data.
  • social_security_number:- This column has the customer’s government-issued social security number information.
# import necessary libs
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
# create a SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
# define the schema for the DataFrame
schema = StructType([
        StructField("customer_name", StringType(), True),
        StructField("mail_id", StringType(), True),
        StructField("mobile_num", LongType(), True),
        StructField("social_security_number", StringType(), True)
])
# create the sample data
data = [ ("Max", '[email protected]', 9789457864, '7548-8546-4512'),
("Michael", '[email protected]', 9089848243, '7845-8745-8756'),
("Alex", '[email protected]', 9589848643, '3245-6547-9854'),
("Hector", '[email protected]', 9189648245, '6547-7845-2150')
]
# create the DataFrame
df = spark.createDataFrame(data, schema)
df.show()

The output of the above dataframe will be:-

+-------------+-------------------+----------+----------------------+
|customer_name|            mail_id|mobile_num|social_security_number|
+-------------+-------------------+----------+----------------------+
|          Max|      [email protected]|9789457864|        7548-8546-4512|
|      Michael|[email protected]|9089848243|        7845-8745-8756|
|         Alex|[email protected]|9589848643|        3245-6547-9854|
|       Hector|[email protected]|9189648245|        6547-7845-2150|
+-------------+-------------------+----------+----------------------+

The above data, like name, email, mobile number, and social security number, are the user’s personal information and can’t be shared directly with any other person or organization. To share these details, we must encrypt and send this data. On the other end, the receiver can decrypt this data with the key.

Using aes_encrypt and aes_decrypt Functions

We will work with the inbuilt function to encrypt the above dataframe data. We will use the aes_encrypt function to encrypt the ‘mail_id’, ‘mobile_num’, and ‘social_security_number’ columns. Later we will use the aes_decrypt function to decrypt the encrypted data. The decoded data value will get compared with the original values for successful decryption.

aes_encrypt() – This function encrypts the plain text. In this, we will pass the column name whose data needs to encrypt inside the expr arguments. Then we give the key to decrypt the encrypted data. Then we pass the mode argument value and, finally, the padding value. The output of this function is the encrypted values.

This function will take the following arguments as input:-

  • ‘expr’ – The binary value to encrypt the data.
  • ‘key’ – The passphrase value to use to encrypt the data.
  • ‘mode’ – Select the block cypher mode to encrypt the messages. Valid modes are ECB and GCM.
  • ‘padding’ – Used to pad messages whose length is not in a multiple of the block size. Valid values are PKCS, NONE, and DEFAULT. The DEFAULT padding means PKCS for ECB and NONE for GCM.

Syntax of this function is aes_encrypt(expr, key[, mode[, padding]]). The output of this function will be encrypted data values. This function supports the key lengths of 16, 24, and 32 bits. The default mode is the GCM.

Now we will pass the column names in the expr function to encrypt the data values. The column names whose data we will encrypt are ‘mail_id’, ‘mobile_num’, and ‘social_security_num’. We are going to store the encrypted data in a new dataframe.

enc_df = df.withColumn('encrypted_mail', expr("base64(aes_encrypt(mail_id, '1234567890abcdef', 'ECB', 'PKCS'))"))
           .withColumn('encrypted_mobile_num', expr("base64(aes_encrypt(mobile_num, '1234567890abcdgh', 'ECB', 'PKCS'))"))
           .withColumn('encrypted_ssn', expr("base64(aes_encrypt(social_security_number, '1234567890abcdij', 'ECB', 'PKCS'))"))
enc_df.show()

In this, we have created new column names using the ‘withColumn’ function; inside it, we have passed the column name in the expr function. We have used ‘1234567890abcdef’ as the encryption key to encrypt the ‘mail_id’ data. ECB is the mode, and PKCS is helpful for padding. The same thing goes for the other two columns. Only the keys are different. Here we also used ‘base64’ conversion to convert the bytes data into a text string.

Now we get the encrypted data which looks like this.

+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name|            mail_id|mobile_num|social_security_number|      encrypted_mail|encrypted_mobile_num|       encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|          Max|      [email protected]|9789457864|        7548-8546-4512|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...|
|      Michael|[email protected]|9089848243|        7845-8745-8756|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...|
|         Alex|[email protected]|9589848643|        3245-6547-9854|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...|
|       Hector|[email protected]|9189648245|        6547-7845-2150|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+

So, we have our encrypted data, and now we will see how to decrypt this data and get our original data back.

aes_decrypt()- We use this function to decrypt the data values. In this, we pass the data column information whose data need to decode. It will return the decoded data values as the final output.

This function will take the following arguments as input:-

  • ‘expr’ – The binary value to decrypt the data.
  • ‘key’ – The passphrase value to use to decrypt the data.
  • ‘mode’ – Select the block cypher mode to decrypt the messages. Valid modes: ECB, GCM.
  • ‘padding’ – Used to pad messages whose length is not in a multiple of the block size. Valid values are PKCS, NONE, and DEFAULT. The DEFAULT padding means PKCS for ECB and NONE for GCM.

Syntax of this function is aes_decrypt(expr, key[, mode[, padding]]). The output of this function will be decrypted original data values. This function supports the key lengths of 16, 24, and 32 bits.

Now we will pass the encrypted data columns in this function and compare the results with the original data.

# original data
+-------------+-------------------+----------+----------------------+
|customer_name|            mail_id|mobile_num|social_security_number|
+-------------+-------------------+----------+----------------------+
|          Max|      [email protected]|9789457864|        7548-8546-4512|
|      Michael|[email protected]|9089848243|        7845-8745-8756|
|         Alex|[email protected]|9589848643|        3245-6547-9854|
|       Hector|[email protected]|9189648245|        6547-7845-2150|
+-------------+-------------------+----------+----------------------+
# encrypted data
+-------------+--------------------+--------------------+--------------------+
|customer_name|      encrypted_mail|encrypted_mobile_num|       encrypted_ssn|
+-------------+--------------------+--------------------+--------------------+
|          Max|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...|
|      Michael|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...|
|         Alex|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...|
|       Hector|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...|
+-------------+--------------------+--------------------+--------------------+
# decrypted data
+-------------+-------------------+--------------------+--------------+
|customer_name|     decrypted_mail|decrypted_mobile_num| decrypted_ssn|
+-------------+-------------------+--------------------+--------------+
|          Max|      [email protected]|          9789457864|7548-8546-4512|
|      Michael|[email protected]|          9089848243|7845-8745-8756|
|         Alex|[email protected]|          9589848643|3245-6547-9854|
|       Hector|[email protected]|          9189648245|6547-7845-2150|
+-------------+-------------------+--------------------+--------------+

Using Cryptography Library

Now we will use the cryptography library to perform encryption and decryption. In this, we will create a user-defined function (udf) that will take data and complete the encryption and decryption.

Encrypting –

# import necessary libs
from pyspark.sql.functions import udf, lit, col
from cryptography.fernet import Fernet
# encrypt func
def encrypt_data(plain_text, KEY):
    f = Fernet(KEY)
    encrip_text = f.encrypt(str(palin_text).encode()).decode()
    return encrp_text
encrypt_udf = udf(encrypt_val, StringType())
# generate the encryption key
Key = Fernet.generate_key()
# encrypt the 'mail_id', 'mobile_num', and 'social_security_number' cols
enc_df = df.withColumn("encrypted_mail_id", encrypt(col('mail_id'), lit(Key))) 
           .withColumn("encrypted_mobile_num", encrypt(col('mobile_num'), lit(Key))) 
           .withColumn("encrypted_ssn", encrypt(col('social_security_number'), lit(Key)))
enc_df.show()

In this, we have to generate the key to encrypt the data using the cryptography library, then pass the columns that we want to encrypt, and pass the encryption key along with it. Now we will see the encrypted results.

+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name|            mail_id|mobile_num|social_security_number|   encrypted_mail_id|encrypted_mobile_num|       encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|          Max|      [email protected]|9789457864|        7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...|
|      Michael|[email protected]|9089848243|        7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...|
|         Alex|[email protected]|9589848643|        3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...|
|       Hector|[email protected]|9189648245|        6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+

So, the ’email_id’, ‘mobile_num’, and ‘social_security_num’ gets encrypted.

Now we will see how to decrypt these encrypted columns to get the original values back.

Decrypting-

def decrypt_data(encrypt_data, KEY):
    f = Fernet(bytes(KEY))
    decoded_val = f.decrypt(encrypt_data.encode()).decode()
    return decoded_val
decrypt_udf = udf(decrypt_data, StringType())
# decrypt the data
dec_df = enc_df.withColumn("decrypted_mail_id", decrypt_udf(col('encrypted_mail_id'), lit(Key))) 
             .withColumn("decrypted_mobile_num", decrypt_udf(col('encrypted_mobile_num'), lit(Key))) 
             .withColumn("decrypted_ssn", decrypt_udf(col('encrypted_ssn'), lit(Key))) 
             .drop('mail_id', 'mobile_num', 'social_security_number')
dec_df.show()

In this, we successfully decrypted the data and got back our original data. We can now see the result and compare it with actual data.

# original and encrypted data
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name|            mail_id|mobile_num|social_security_number|   encrypted_mail_id|encrypted_mobile_num|       encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|          Max|      [email protected]|9789457864|        7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...|
|      Michael|[email protected]|9089848243|        7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...|
|         Alex|[email protected]|9589848643|        3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...|
|       Hector|[email protected]|9189648245|        6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
# decrypted data
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+
|customer_name|   encrypted_mail_id|encrypted_mobile_num|       encrypted_ssn|  decrypted_mail_id|decrypted_mobile_num| decrypted_ssn|
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+
|          Max|gAAAAABjpEE9TcrVL...|gAAAAABjpEE907red...|gAAAAABjpEE92mIuZ...|      [email protected]|          9789457864|7548-8546-4512|
|      Michael|gAAAAABjpEE9UXJF6...|gAAAAABjpEE9OlqYJ...|gAAAAABjpEE9TV8rm...|[email protected]|          9089848243|7845-8745-8756|
|         Alex|gAAAAABjpEE93b3z_...|gAAAAABjpEE9knvQ7...|gAAAAABjpEE9rXc4g...|[email protected]|          9589848643|3245-6547-9854|
|       Hector|gAAAAABjpEE9bbV1Z...|gAAAAABjpEE9DfOWj...|gAAAAABjpEE9Lvw6g...|[email protected]|          9189648245|6547-7845-2150|
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+

Note:- Encryption and hashing are different things. Hashing, once done, cannot be reverted to the original data. At the same time, we can decode the encoded values later to get the actual data value back.

Conclusion

In this article, we have covered two methods to encrypt and decrypt data while sharing. By doing so, we can ensure that our data is kept secure and protected from unauthorized access. In PySpark, we can achieve this by following the above two methods and efficiently safeguarding our data.

Key takeaways from this article are:-

  1. We have defined the dataframe and used the ‘aes_encryption’ and ‘aes_decryption’ methods to protect our data.
  2. Then we compare the results after decrypting the data to ensure we get the same original data.
  3. Then we use the cryptography library to encrypt and decrypt our data.
  4. In this, we have written a user-defined function (udf) and then used this function to perform the data encryption.

This article helps you to perform encryption and decryption in PySpark. If you have any opinions or questions, then comment down below. Connect with me on LinkedIn for further discussion.

Keep Learning!!!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Kishan Yadav 10 Jan 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear