I am a huge fan of the NumPy library in Python. I have relied on it countless times during my data science journey to perform all sorts of tasks, from basic mathematical operations to using it for image classification!
In short – NumPy is one of the most fundamental libraries in Python and perhaps the most useful of them all. NumPy handles large datasets effectively and efficiently. I can see your eyes glinting at the prospect of mastering NumPy already. 🙂 As a data scientist or as an aspiring data science professional, we need to have a solid grasp on NumPy and how it works in Python.
In this article, I am going to start off by describing what the NumPy library is and why you should prefer it over the ubiquitous but cumbersome Python lists. Then, we will cover some of the most basic NumPy operations that will get you hooked on to this awesome library!
If you’re new to Python, don’t worry! You can take the comprehensive (and free) Python course to learn everything you need to get started with data science programming!
NumPy stands for Numerical Python and is one of the most useful scientific libraries in Python programming. It provides support for large multidimensional array objects and various tools to work with them. Various other libraries like Pandas, Matplotlib, and Scikit-learn are built on top of this amazing library.
Arrays are a collection of elements/values, that can have one or more dimensions. An array of one dimension is called a Vector while having two dimensions is called a Matrix.
NumPy arrays are called ndarray or N-dimensional arrays and they store elements of the same type and size. It is known for its high-performance and provides efficient storage and data operations as arrays grow in size.
NumPy comes pre-installed when you download Anaconda. But if you want to install NumPy separately on your machine, just type the below command on your terminal:
pip install numpy
Now you need to import the library:
import numpy as np
np is the de facto abbreviation for NumPy used by the data science community.
If you’re familiar with Python, you might be wondering why use NumPy arrays when we already have Python lists? After all, these Python lists act as an array that can store elements of various types. This is a perfectly valid question and the answer to this is hidden in the way Python stores an object in memory.
A Python object is actually a pointer to a memory location that stores all the details about the object, like bytes and the value. Although this extra information is what makes Python a dynamically typed language, it also comes at a cost which becomes apparent when storing a large collection of objects, like in an array.
Python lists are essentially an array of pointers, each pointing to a location that contains the information related to the element. This adds a lot of overhead in terms of memory and computation. And most of this information is rendered redundant when all the objects stored in the list are of the same type!
To overcome this problem, we use NumPy arrays that contain only homogeneous elements, i.e. elements having the same data type. This makes it more efficient at storing and manipulating the array. This difference becomes apparent when the array has a large number of elements, say thousands or millions. Also, with NumPy arrays, you can perform element-wise operations, something which is not possible using Python lists!
This is the reason why NumPy arrays are preferred over Python lists when performing mathematical operations on a large amount of data.
NumPy arrays are very easy to create given the complex problems they solve. To create a very basic ndarray, you use the np.array() method. All you have to pass are the values of the array as a list:
This array contains integer values. You can specify the type of data in the dtype argument:
np.array([1,2,3,4],dtype=np.float32)
Output:
Since NumPy arrays can contain only homogeneous datatypes, values will be upcast if the types do not match:
np.array([1,2.0,3,4])
Output:
Here, NumPy has upcast integer values to float values.
NumPy arrays can be multi-dimensional too.
np.array([[1,2,3,4],[5,6,7,8]])
Here, we created a 2-dimensional array of values.
Note: A matrix is just a rectangular array of numbers with shape N x M where N is the number of rows and M is the number of columns in the matrix. The one you just saw above is a 2 x 4 matrix.
NumPy lets you create an array of all zeros using the np.zeros() method. All you have to do is pass the shape of the desired array:
np.zeros(5)
The one above is a 1-D array while the one below is a 2-D array:
np.zeros((2,3))
np.ones(5,dtype=np.int32)
Another very commonly used method to create ndarrays is np.random.rand() method. It creates an array of a given shape with random values from [0,1):
# random np.random.rand(2,3)
array([[0.95580785, 0.98378873, 0.65133872], [0.38330437, 0.16033608, 0.13826526]])
Or, in fact, you can create an array filled with any given value using the np.full() method. Just pass in the shape of the desired array and the value you want:
np.full((2,2),7)
Another great method is np.eye() that returns an array with 1s along its diagonal and 0s everywhere else.
An Identity matrix is a square matrix that has 1s along its main diagonal and 0s everywhere else. Below is an Identity matrix of shape 3 x 3.
Note: A square matrix has an N x N shape. This means it has the same number of rows and columns.
# identity matrix
np.eye(3)
However, NumPy gives you the flexibility to change the diagonal along which the values have to be 1s. You can either move it above the main diagonal:
# not an identity matrix
np.eye(3,k=1)
Or move it below the main diagonal:
np.eye(3,k=-2)
Note: A matrix is called the Identity matrix only when the 1s are along the main diagonal and not any other diagonal!
You can quickly get an evenly spaced array of numbers using the np.arange() method:
np.arange(5)
The start, end and step size of the interval of values can be explicitly defined by passing in three numbers as arguments for these values respectively. A point to be noted here is that the interval is defined as [start,end) where the last number will not be included in the array:
np.arange(2,10,2)
Alternate elements were printed because the step-size was defined as 2. Notice that 10 was not printed as it was the last element.
Another similar function is np.linspace(), but instead of step size, it takes in the number of samples that need to be retrieved from the interval. A point to note here is that the last number is included in the values returned unlike in the case of np.arange().
np.linspace(0,1,5)
Great! Now you know how to create arrays using NumPy. But its also important to know the shape of the array.
Once you have created your ndarray, the next thing you would want to do is check the number of axes, shape, and the size of the ndarray.
You can easily determine the number of dimensions or axes of a NumPy array using the ndims attribute:
# number of axis a = np.array([[5,10,15],[20,25,20]]) print('Array :','\n',a) print('Dimensions :','\n',a.ndim)
This array has two dimensions: 2 rows and 3 columns.
The shape is an attribute of the NumPy array that shows how many rows of elements are there along each dimension. You can further index the shape so returned by the ndarray to get value along each dimension:
a = np.array([[1,2,3],[4,5,6]])
print('Array :','\n',a)
print('Shape :','\n',a.shape)
print('Rows = ',a.shape[0])
print('Columns = ',a.shape[1])
You can determine how many values there are in the array using the size attribute. It just multiplies the number of rows by the number of columns in the ndarray:
# size of array a = np.array([[5,10,15],[20,25,20]]) print('Size of array :',a.size) print('Manual determination of size of array :',a.shape[0]*a.shape[1])
# reshape a = np.array([3,6,9,12]) np.reshape(a,(2,2))
Here, I reshaped the ndarray from a 1-D to a 2-D ndarray.
While reshaping, if you are unsure about the shape of any of the axis, just input -1. NumPy automatically calculates the shape when it sees a -1:
a = np.array([3,6,9,12,18,24]) print('Three rows :','\n',np.reshape(a,(3,-1))) print('Three columns :','\n',np.reshape(a,(-1,3)))
Sometimes when you have a multidimensional array and want to collapse it to a single-dimensional array, you can either use the flatten() method or the ravel() method:
a = np.ones((2,2)) b = a.flatten() c = a.ravel() print('Original shape :', a.shape) print('Array :','\n', a) print('Shape after flatten :',b.shape) print('Array :','\n', b) print('Shape after ravel :',c.shape) print('Array :','\n', c)
Original shape : (2, 2) Array : [[1. 1.] [1. 1.]] Shape after flatten : (4,) Array : [1. 1. 1. 1.] Shape after ravel : (4,) Array : [1. 1. 1. 1.]
But an important difference between flatten() and ravel() is that the former returns a copy of the original array while the latter returns a reference to the original array. This means any changes made to the array returned from ravel() will also be reflected in the original array while this will not be the case with flatten().
b[0] = 0 print(a)
[[1. 1.] [1. 1.]]
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
First of all, I would like to thank you for putting together a helpful list of all that Numpy can do when it comes to data science. I do have a point to make about the timing section of the article. You show examples using Python's list and Numpy's ufunc add. For the list example you get a mean time of 283 ns For the Numpy example you get a mean time of 1.29 us, or 1,290 ns. This example shows the exact opposite point that is being made, that Numpy ufuncs made things significantly faster. I used the exact same code on my own machine as a double check, and the Numpy example was still slower ( 71 ns for list compared to 431 for Numpy)
It is a nice article, written in a well explained way. I learned a lot from this link. WIth thanks and regards, Kishor Kumar Kumawat
Very informative and helpful 🙂