Top 32 Bash Commands for Data Scientists

guest_blog 02 Jun, 2023 • 13 min read

A command line is a valuable tool for productivity in daily Data Science activities. As Data Scientists, we are adept at using Jupyter Notebooks and RStudio to obtain, scrub, explore, model, and interpret data (OSEMN process). From Pandas to Tidyverse, messy data is handled very effectively and effortlessly to provide input to machine learning algorithms for modeling purposes. However, simple operations such as sorting dataframes and filtering rows are given a condition, or creating complex data pipelines for large datasets can be performed just as quickly using the Bash command-line interface. First released in 1989, Bash (a.k.a., Bourne Again Shell) is an essential part of a data scientist’s toolkit, but one that is not popularly taught in data science bootcamps, Master’s programs, and even online courses.

Source: Unsplash

This article will introduce you to the fantastic world of Bash, beyond the basic commands that are commonly used, such as printing a working directory using pwd, changing directories using cd, listing items in a folder using ls, copying things using cp, moving items using mv, deleting items using rm among others. After going through the article, you will have become familiar with in-built data wrangling commands available in Bash, ready at your disposal.

Bonus: A Cheatsheet has also been provided for quick reference to the commands and how they work! Some miscellaneous commands are included so make sure to check it out!

Basics of Bash
32 General Bash Commands
Data Processing Bash Commands
Other Useful Bash Commands
Bash Commands CheatSheet
Conclusion
Frequently Asked Questions

Basics of Bash

A Command-Line Interface (CLI) allows users to write commands to instruct the computer to perform specific tasks. “Shell” is a CLI and is so called because it restricts the outer layer of the system from the inner operating system, i.e., the kernel. Shell is tasked with reading the commands, interpreting them, and directing the operating system to perform those tasks.

Each command is preceded by a dollar sign ($) called the prompt. Only the $ sign is used in the examples in this article because the prompt is irrelevant to the actual commands and changes when you go to another directory, and it can also be customized. The command structure follows this sequence: command -options arguments.

32 General Bash Commands

Let’s look at some simple commands in Bash:

Which – outputs the full path of the command specified in the argument
rm – removes permanently any file (not folder)
rm -rf – recursively removes any file or folder permanently
ls – lists directories and files
mv – moves files and directories from one folder to another
cp – copies files and directories from one folder to another
mkdir – creates new directories
curl – downloads and uploads the data using FTP, HTTP, HTTPS and SFTP protocols
top – monitors running processes and memory being used by them
cd – changes the current directory
pwd – prints the path of the working directory
sudo – allows executing the command as a superuser
history – prints the list of past commands executed in Bash
clear – clears the screen
find – finds the files that have specific characteristics mentioned as an argument to the command
man – displays the user manual of any command
type – identifies whether the command is a built-in shell command, alias, keyword, or subroutine
pip – installs packages from PyPI and is most frequently used for installing Python packages
tar – archiving utility that can be used to compress files
whoami – displays the user ID (local account used to log in)
hostname -i – displays the hostname (name of the machine)
date – displays the date and time
cal – displays the calendar with the current date highlighted as per system settings
uname -r – displays the OS version
uptime – displays how long the system has been running
reboot – reboots the system
free – shows the amount of free and used-up memory space
df – shows the amount of disk space available
exit – used to exit the terminal
echo $0 – displays the current shell
lscpu – shows the CPU details
cat /etc/shells – displays all available shells in the system, such as Bourne shell (sh), Korn shell (ksh), Bash, C shell, etc.

Data Processing Bash Commands

Let us review the suite of commands in Bash that Data Scientists use. We will use two datasets, the first being Apple’s 40-year stock history from January 1, 1981, till December 31, 2020, that can be downloaded from Yahoo Finance here. The second dataset is a custom dataset as below.

1. wc command

wc command for word count returns the number of lines, words and characters in a file in this order.

Print number of lines, words, and characters of a file using wc command

You can use the input redirection operator “<” if you do not wish to see the result’s file path.

Print a file’s number of lines, words, and characters without showing the file path using the wc command.

Furthermore, use options -l to return the number of lines, -w to return the number of words, and -m to produce several characters.

Print the number of lines, words, and characters using wc command with specified options

2. head command

The head command returns top 10 lines in the file by default.

Print the first 10 lines of the file using head command

To see first n lines, pass in the -n option stating the number of lines.

Print the first 2 lines of the file using head command

Print the first 5 lines of the file using head command

3. tail command

The tail command returns bottom 10 lines in the file by default.

Print the last 10 lines of the file using tail command

Similar to head command, to see last n lines, pass in the -n option.

Print the last 3 lines of the file using tail command

4. cat command

cat for concatenate is a multi-purpose command used for creating files and viewing file content among its other uses. To view an existing file’s content, simply pass the file path as the argument. cat returns all the lines in the file.

Displaying first 5 lines of entire file output returned by cat command

To keep track of the line numbers, -n option can be used.

Displaying first 5 lines of entire file output returned by cat command, showing line numbers using -n option

To create a new file, use the output redirection operator “>” after cat and specify filename. Add your content in provided space, and press CTRL+D to exit editor.

Create a new file and add content using cat command

To append to an existing file, use the append operator “>>” after cat and specify filename. As before, add the content to be appended, and press CTRL+D to exit editor.

Append to an existing file using cat command

Display the contents of NewFile.csv after creation and appending using cat command

5. sort command

The sort command is used to sort contents of a file, by the ASCII order of blank first, then digits, then uppercase letters followed by lowercase letters.

Let us sort the custom dataset for understanding this command. By default, the sort command sorts in ascending order and acts lexicographically on the first character in each line in the dataset (I, 1, 7, 9, 2, 4). Lexicographic sorting means that “29” comes before “4,5”. However, since we have a comma-separated file, we want sort to act on columns, and by default, to act on the first column of (ID, 1312, 7891, 9112, 2236, 4561). Thus, we pass in the delimiter option -t and the comma delimiter.

Sorting the first column using default sort command with -t option

Notice that the header row is shifted to the end, since uppercase letters come after digits in the ASCII order. To sort numerically, we should use -n option. This ensures that sorting is only done numerically rather than lexicographically.

Sorting the first column numerically using sort command with -n option

Notice now that the header row is unaffected. To reverse sort the dataset, pass in the option -r. The output is the reverse of the numerical sorting output.

Reverse sorting the first column using sort command with -r option

To sort a particular column, pass in the -k option with the column number. Here, let us sort on age in ascending order.

Sorting age column using sort command with -k option

Let us also see an example of sorting non-numeric columns, such as the last column of “major”.

Sorting non-numeric column using sort command

The sort command can also be used to sort month columns using -M option, check if column is already sorted using -c option, remove duplicates and sort using -u option.

6. tr command

The tr command stands for “translate”, and is used for translating and deleting characters. It reads only from standard input and shows the output on standard output.

Here, we will introduce the pipe operator “|” that passes the standard output of one command as standard input into another command, like a pipeline. Let us again use the custom dataset for understanding this command.

To convert uppercase characters to lowercase, pass in the first argument as “[:upper:]” and second argument as “[:lower:]”, and vice-versa. Alternatively, first argument can be “[A-Z]” and second one will then be “[a-z]”.

Converting uppercase characters to lowercase using tr command

To translate the comma-separated file into a tab-separated file, use tr command with first argument as “,” and second argument as “\t”.

Converting csv file format to a tsv file format using tr command

To delete a character in a file, use the delete option -d with the tr command. The operation is case-sensitive.

Deleting the character “S” using tr command with -d option

Notice that the character “S” is deleted from the entire file. Similarly to remove all uppercase letters, use character string “[:upper:]”, to remove all digits, use character string “[:digit:]” and so on.

Deleting all digit characters using tr command with -d option

To delete everything except a character, use the complement option -c and the delete option -d with the tr command.

Delete all characters except uppercase letters using tr command with -c and -d options

To replace multiple continuous occurrences of character with single event, use the squeeze repeats option -s with the tr command giving only one argument as input.

Keeping a single character instance of “2” using tr command with -s option

To replace all single and multiple continuous occurrences of a character with another character, use the squeeze repeats option -s with the tr command giving two arguments as input. Note that all numerous endless events are also replaced with the single character.

Replacing all single and multiple occurrences of “2” with “h” using tr command with -s option

7. paste command

The paste command joins two files horizontally using a tab delimiter by default.

The -d option can be used to specify a custom delimiter. Let’s concatenate the two datasets with a comma delimiter and see the first 6 rows using the pipe operator “|”.

Concatenating two datasets horizontally using the paste command with -d option specified as a comm

8. uniq command

The uniq command detects and filters out duplicate rows in a file.

Let us use the custom dataset, and append two duplicate lines to the file first.

Appending duplicate rows to file using append operator “>>” with cat command

This is how the dataset looks like now.

View of data file after adding duplicate line items

Now, let’s see the number of lines with their count using -c option.

Find the count of each line item using uniq command with -c option

Notice that the detection of duplicate entries is case-sensitive. To ignore case, use the -i option.

Find the count of each line item ignoring case using uniq command with -c and -i options

Other valuable options with the uniq command include -u option that returns unique line items, and -d option that returns only duplicate line items.

9. grep command

grep stands for “global regular expression print” and is Bash’s in-built utility for searching line items matching a regular expression.

Let us search for all rows which have “John” using grep.

Searching for line items matching regular expression using grep command

Since grep is case-sensitive, we can use -i option to ignore the case for matching.

Searching for line items matching regular expression (case-insensitive) using grep command with -i option

The number of lines containing “John” can be returned using the count option -c.

Counting number of lines matching regular expression using grep command with -c option

To match whole words instead of a substring using grep command, use the word option -w. To demonstrate, let’s first append a new row using cat command.

Appending a new row in data file using append operator “>>” with cat command

Let’s see the default output of grep command.

Default search for regular expression using grep command

To search only for whole word of “ohn” instead of all substrings, let’s now use the word option -w.

Searching for whole words of regular expression using grep command

To keep track of the line numbers of line items returned by grep command, use the -n option.

Print line numbers for lines matched by regular expression using grep command with -n option

10. cut command

The cut command is used to cut and extract sections from each file line.

The field option -f must be used to return a particular column. The field counter for the option starts from 1 and not 0 for the first column onwards.

Return a field in a data file using cut command with the -f option

As we can see, Bash isn’t able to identify the columns, thus the delimiter option -d must used in conjunction with it.

Return a field in a data file using cut command with the -d and -f options

Let’s look at a more complex example of the cut command using the Apple stock prices dataset. Specifically, we want to see the columns – Date, High, Low, Volume – of the first 10 data rows in the file. To do this, we first get first 11 rows (including header) using head command as standard output, and pipe it into the cut command. Note that the field option -f gets multiple column numbers as input.

Returning a subset of rows and columns using head and cut commands with pipe operator

Other Useful Bash Commands

The “!$” unique character in Bash is used to designate the last argument of the preceding command. CTRL + R is used for reverse searching for commands through the Bash session history.

To understand it better, let’s see an example of “!$” character.

Returning standard output of data file using cat command

Now, say we want to look at only the first 3 rows in the file. Instead of repeatedly mentioning the entire file path in my new command, I can type “head -3 !$”. The “!$” unique character will automatically take in the path.

Printing the first 3 rows using “!$” unique character with head command

CTRL + R for reverse-i-search is beneficial for searching through any old and long command you’d written and want to bring up again. The command searches recursively starting at the last matched command, and moves up the history. Furthermore, the characters typed in get incrementally compared with the previous commands.

Reverse searching for grep commands in the session history

Reverse searching for grep command with -i option having “John” value in the session history

The entire command history of the session can be seen using history command if manual search needs to be done.

Bash Commands CheatSheet

Conclusion

Bash command line for data scientists is a very useful tool for some quick data analysis, without launching any integrated development environment. All the commands become more powerful tools when combined with Input/Output redirection (“<”, “>”, “>>”) and pipe (“|”) utilities of Bash. Experiment with these utilities and find efficient ways of wrangling with your data. Make sure to get your hands dirty to leverage Bash for your data needs!

Frequently Asked Questions

Q1. What are Bash commands?

A. Bash commands are executed in the Bash shell, a popular command-line interface and scripting language used in Unix-like operating systems. They allow users to interact with the operating system, execute programs, manipulate files, and perform various system-related tasks.

Q2. What is the list command for Bash?

A. The “ls” command is used in Bash to list the contents of a directory. It displays the files and directories in the current working directory or any specified directory, along with their details, such as permissions, size, and timestamps.

Q3. How do I start a Bash command?

A. To start a Bash command, you need to open a terminal or command prompt, depending on your operating system. Once the terminal is open, you can type and execute Bash commands directly by typing the command followed by pressing the Enter key.

Q4. What is Gitbash used for?

A. Git Bash is a command-line interface for Windows that provides a Unix-like environment, including a Bash shell and a collection of Git commands. It allows Windows users to use Git version control system and execute Bash commands, making it easier to work with Git repositories and perform command-line operations.

guest_blog 02 Jun 2023

Beginner Data Science Pandas

Top 32 Bash Commands for Data Scientists

Table of contents

Basics of Bash

32 General Bash Commands

Data Processing Bash Commands

1. wc command

2. head command

3. tail command

4. cat command

5. sort command

6. tr command

7. paste command

8. uniq command

9. grep command

10. cut command

Other Useful Bash Commands

Bash Commands CheatSheet

Conclusion

Frequently Asked Questions

Frequently Asked Questions

Responses From Readers

Related Courses

Top Data Science Projects for Analysts and Data Scientists

Free

Write for us