Storage options and File manipulation Commands in Azure Databricks

Vikas Verma 24 Jun, 2021 • 4 min read

This article was published as a part of the Data Science Blogathon

Databricks is a unified analytics platform on top of Apache Spark for large-scale data processing, streaming, and machine learning applications. It also includes interoperability with cloud leaders like AWS, Azure to get unmatched scale and performance of the cloud.

Azure Databricks
Image Source – Azure Databricks | Knoldus Blogs

 

Azure Databricks provides auto-scaling, auto-termination of clusters, auto-scheduling of jobs along with simple job submissions to the cluster.

In this blog, we will discuss the easily available storage options over Azure Databricks, their comparison, and different ways to interact with them.

Data over Azure Databricks can be broadly stored in three major storage types.

  1. Databricks File System (DBFS)
  2. Azure Blob Storage
  3. Azure Data Lake Storage Gen2 (ADLS Gen2)

In this post, we are going to discuss DBFS and Azure Blob Storage only.

 

DBFS(Databricks File System)

DBFS can be majorly accessed in three ways.

  1. File upload interface
  2. Databricks CLI
  3. DButils

1. File upload interface

Files can be easily uploaded to DBFS using Azure’s file upload interface as shown below.

Azure Databricks DBFS

To upload a file, first click on the “Data” tab on the left (as highlighted in red) then select “Upload File” and click on “browse” to select a file from the local file system. By default, files are uploaded in the “/FileStore/tables” folder (as highlighted in yellow), but we can also upload in any other/new folder by specifying the folder name during uploading time.

Downsides of File upload interface

  • Folders can not be uploaded directly, folders need to be zipped before uploading.
  • The only upload operation is supported, other file-level operations like copy, move, delete, rename, etc. are not supported by this interface
  • Similarly, it does not allow downloading files in the local file system.

2. Databricks CLI

DBFS command-line interface(CLI) is a good alternative to overcome the downsides of the file upload interface. Using this, we can easily interact with DBFS in a similar fashion to UNIX commands.

databricks-cli is a python package that allows users to connect and interact with DBFS.

Databricks CLI configuration steps

1. Install databricks-cli using –

pip install databricks-cli

2. Configure the CLI using –

databricks configure --token

3. Above command prompts for Databricks Host(workspace URL) and access Token. Specify the same accordingly.

Basic File-level operations using Databricks CLI

a. Listing file in DBFS

In the terminal type

dbfs ls

Similarly, to list content to a particular directory, specify the directory name(prefixed with dbfs:/) after ls. For e.g.

dbfs ls dbfs:/FileStore/tables

b. Making a new directory/folder

# mkdirs command 
dbfs mkdirs directory_path 
# For e.g. 
dbfs mkdirs dbfs:/FileStore/tables/temp_dir

c. Copying files/folder from local to DBFS and vice-versa

# To copy a file 
dbfs cp source_file_path destination_path 
# From local to DBFS 
dbfs cp /home/user1/Desktop/databricks.jpg dbfs:/FileStore/tables 
# From DBFS to local 
dbfs cp dbfs:/FileStore/tables/databricks.jpg /home/user1
# To copy a folder 
dbfs cp -r source_folder_path destination_folder_path 
# From local to DBFS 
dbfs cp -r /home/user1/Desktop/dummy_folder dbfs:/FileStore/tables/dummy_folder 
# From DBFS to local 
dbfs cp -r dbfs:/FileStore/tables/dummy_folder /home/user1/dummy_folder

d. Moving/Renaming files over DBFS

# Move command 
dbfs mv source_file_path destination_file_path 
# Moving file in a different folder 
dbfs mv dbfs:/FileStore/tables/databricks.jpg dbfs:/FileStore/tables/temp_dir/databricks.jpg
# Renaming file 
dbfs mv dbfs:/FileStore/tables/temp_dir/databricks.jpg dbfs:/FileStore/tables/temp_dir/databricks1.jpg

e. Deleting files & folder

# rm command 
dbfs rm [-r] file_or_folder_path 
#deleting a file 
dbfs rm dbfs:/FileStore/tables/temp_dir/databricks1.jpg
#deleting a folder 
dbfs rm -r dbfs:/FileStore/tables/dummy_folder

NOTE – Commands Source:- https://docs.databricks.com/dev-tools/cli/dbfs-cli.html

3. DButils

Programmatically(specifically using Python), DBFS can be easily accessed/interacted using dbutils.fs commands.

# listing content of a directory 
dbutils.fs.ls("/FileStore")
# making a new directory 
dbutils.fs.mkdirs("/FileStore/tables/temp_dir2")
# copying a file 
dbutils.fs.cp("/FileStore/tables/databricks.jpg", "/FileStore/tables/temp_dir2")
# copying a folder 
dbutils.fs.cp("/FileStore/tables/temp_dir", "/FileStore/tables/temp_dir2/temp_dir", recurse = True)
# moving a file 
dbutils.fs.mv("/FileStore/tables/temp_dir/databricks.jpg", "/FileStore/tables/temp_dir2/databricks.jpg")
# moving a folder 
dbutils.fs.mv("/FileStore/tables/temp_dir", "/FileStore/tables/temp_dir2/temp_dir",recurse = True)
# deleting a file 
dbutils.fs.rm("/FileStore/tables/temp_dir2/databricks.jpg")
# deleting a folder 
dbutils.fs.rm("/FileStore/tables/temp_dir2/temp_dir",recurse = True

NOTE – Commands Source: https://docs.databricks.com/_static/notebooks/dbutils.html

Azure Blob Storage

Data can also be stored in Azure Blob. It is ideal for storing massive amounts of unstructured data.

Before storing the data into Azure Blob, first, we need to create a Storage account over the Azure portal. Within a storage account, we can have multiple containers. We have already created a storage account named “dummy_storage_account” here for demo purposes.

A container can be created either using the portal interface or AzCopy(a command-line utility).

AzCopy

AzCopy is a command-line utility that allows transferring data to and from a storage account/local computer. To download and install AzCopy, follow the steps here.

In order to maintain secured delegated access to storage accounts, azure provides Shared Access Signature(SAS) for the resources.

SAS(Shared Access Signature) for a storage account can be easily obtained from its home page as shown below.

Azure Databricks AzCopy

Click on the “Shared access signature” tab on the left side(as highlighted) then check all the boxes under “Allowed resource types”(as highlighted).

Now click on the “Generate SAS and connection string” button as shown below.

Azure Databricks Generate SAS collection

Copy the SAS URL from the “Blob services SAS URL” section. This URL is required in AzCopy commands.

Now, let’s discuss some common operations using AzCopy.

a. Creating Container

azcopy make "<SAS_URL>" 
# For e.g. 
sudo azcopy make "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2020-02-10&ss=bfqt&srt=sco&sp=rwdlacuptfx&se=2021-06-22T21:53:01Z&st=2021-06-22T13:53:01Z&spr=https&sig=7Kv6vyhGN78700hjT%2FTeHx%2BeVfIdzazaSM6LnutuROM%3D"

NOTE – Also, add the container name in the SAS URL after dummydatastorage.blob.core.windows.net/. Here the container name is “dummycontainer”.

b. Copying data from local to Azure Blob and vice-versa

# Copying a file 
azcopy copy '<local-file-path>' '<SAS_URL>' 
# For e.g. 
sudo azcopy copy "/home/user1/Desktop/databricks.jpg" "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-10-29T16:58:42Z&st=2020-10-29T08:58:42Z&spr=https&sig=TJ2Ujv%2FHkm0x5NZkbQkcHhI4SshPKSUWsY%2BP2GkZ6kk%3D"
#Copying a folder 
azcopy copy '<local-directory-path>' '<SAS_URL>' [--recursive] 
# For e.g. 
sudo azcopy copy "/home/user1/Desktop/dummy_folder" "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-10-29T16:58:42Z&st=2020-10-29T08:58:42Z&spr=https&sig=TJ2Ujv%2FHkm0x5NZkbQkcHhI4SshPKSUWsY%2BP2GkZ6kk%3D" --recursive

Similarly, data from Azure Blob to local can be copied by changing the source and destination order in the azcopy copy command.

NOTE – Commands Source: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-files

 

End Notes

In this article, we discussed various storage options available in Azure Databricks, commands to perform numerous file and directory-level operations.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

 

Vikas Verma 24 Jun 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Paul Hanks
Paul Hanks 03 Dec, 2021

Azcopy, Duplicati, Gs Richcopy 360, Carbonite and GoodSync are my best tools used to upload to Azure cloud or AWS