An Introduction to Machine Learning Libraries for C++
I love working with C++, even after I discovered the Python programming language for machine learning. C++ was the first programming language I ever learned and I’m delighted to use that in the machine learning space!
I wrote about building machine learning models in my previous article and the community loved the idea. I received an overwhelming response and one query stood out for me (from multiple folks) – are there any C++ libraries for machine learning?
It’s a fair question. Languages like Python and R have a plethora of packages and libraries that cater to different machine learning tasks. So does C++ have any such offering?
Yes, it does! I will highlight two such C++ libraries in this article, and we’ll also see both of them in action (with code). If you’re new to C++ for machine learning, I’ll again recommend going through the first article.
Table of Contents
- Why Should we use Machine Learning Libraries?
- Machine Learning Libraries in C++
- SHARK Library
- MLPACK Library
Why Should we use Machine Learning Libraries?
This is a question a lot of newcomers will have. What is the importance of libraries in machine learning? Let me try and explain that in this section.
Let’s say that experienced professionals and industry veterans have put in the hard graft and come up with a solution to a problem. Would you prefer using that or would you rather spend hours trying to re-create the same thing from scratch? That’s usually little point in going for the latter method, especially when you’re working or learning under deadlines.
The great thing about our machine learning community is that a lot of solutions already exist in the form of libraries and packages. Someone else, from experts to enthusiasts, has already done the hard work and put the solution together nicely packaged in a library.
These machine learning libraries are efficient and optimized, and they are tested thoroughly for multiple use cases. Relying on these libraries is what powers our learning and makes writing code, whether that’s in C++ or Python, so much easier and intuitive.
Machine Learning Libraries in C++
In this section, we’ll look at the two most popular machine learning libraries in C+:
- SHARK Library
- MLPACK Library
Let’s look at each one individually and see how the C++ code works.
1) SHARK C++ Library
Shark is a fast modular library and it has overwhelming support for supervised learning algorithms, such as linear regression, neural networks, clustering, k-means, etc. It also includes the functionality of linear algebra and numerical optimization. These are key mathematical functions or areas that are very important when performing machine learning tasks.
We’ll first see how to install Shark and set up an environment. Then we’ll implement linear regression with Shark.
Install Shark and setup environment (I’ll be doing this for Linux)
- Shark relies on Boost and cmake. Fortunately, all the dependencies can be installed with the following command:
sudo apt-get install cmake cmake-curses-gui libatlas-base-dev libboost-all-dev
- To install Shark, run the below commands line-by-line in your terminal:
- git clone https://github.com/Shark-ML/Shark.git (you can download the zip file and extract as well)
- cd Shark
- mkdir build
- cd build
- cmake ..
If you haven’t seen this before that’s not a problem. It’s pretty straightforward and there’s plenty of information online if you get into trouble. For Windows and other operating systems, you can do a quick Google search on how to install Shark. Here is the reference site Shark Installation guide.
Compiling programs with Shark
- Include the relevant header files. Suppose we want to implement linear regression, then the extra header files included are:
#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> #include <shark/Algorithms/Trainers/LinearRegression.h>
- To compile, we need to link with the following libraries:
-std=c++11 -lboost_serialization -lshark -lcblas
Implementing Linear Regression with Shark
My first article on this series had an introduction to linear regression. I’ll use the same idea in this article, but this time using the Shark C++ library.
We’ll start by including the libraries and header functions for linear regression:
Next comes the dataset. I have created two CSV files. The input.csv file contains the x values and the labels.csv file contains the y values. Below is a snapshot of the data:
You can find both the files here – Machine Learning with C++. First, we’ll make data containers for storing the values from CSV files:
Next, we need to import them. Shark comes with a nice import CSV function, and we specify the data container that we want to initialize, and also the location to path file of the CSV:
We then need to instantiate a regression dataset type. Now, this is just a general object for regression, and what we’ll do in the constructor is we pass in our inputs and also our labels for the data.
Next, we need to train the linear regression model. How do we do that? We need to instantiate a trainer and define a linear model:
Next comes the key step where we actually train the model. Here, the trainer has a member function called train. So this function trains this model and finds the parameters for the model, which is exactly what we want to do.
Finally, let’s output the model parameters:
Linear models have a member function called offset that outputs the intercept of the best fit line. Next, we’re outputting a matrix instead of a multiplier. This is because the model can be generalized (not just linear, it could be a polynomial).
We compute the line of best fit by minimizing the least-squares, aka minimizing the squared loss.
So, fortunately, the model allows us to output that information. The Shark library is very helpful in giving an indication as to how well the model(s) fit:
First, we need to initialize a squared loss object, and then we need to instantiate a data container. The prediction then is computed based on the inputs to the system, and then all we need to do is output the loss, which is computed via passing the labels and also the prediction value.
Finally, we need to compile. In the terminal, type the below command (make sure the directory is properly set):
g++ -o lr linear_regression.cpp -std=c++11 -lboost_serialization -lshark -lcblas
Once compiled, it would have created an lr object. Now simply run the program. The output we get is:
b : (-0.749091)
The value of b is a little far from 0 but that is because of the noise in labels. The value of the multiplier is close to 2 which is quite similar to the data. And that’s how you can use the Shark library in C++ to build a linear regression model!
2) MLPACK C++ Library
mlpack is a fast and flexible machine learning library written in C++. It aims to provide fast and extensible implementations of cutting-edge machine learning algorithms. mlpack provides these algorithms as simple command-line programs, Python bindings, Julia bindings, and C++ classes which can then be integrated into larger-scale machine learning solutions.
We’ll first see how to install mlpack and the setup environment. Then we’ll implement the k-means algorithm using mlpack.
Install mlpack and the setup environment (I’ll be doing this for Linux)
mlpack depends on the below libraries that need to be installed on the system and have headers present:
- Armadillo >= 8.400.0 (with LAPACK support)
- Boost (math_c99, program_options, serialization, unit_test_framework, heap, spirit) >= 1.49
- ensmallen >= 2.10.0
In Ubuntu and Debian, you can get all of these dependencies through apt:
sudo apt-get install libboost-math-dev libboost-program-options-dev libboost-test-dev libboost-serialization-dev binutils-dev python-pandas python-numpy cython python-setuptools
Now that all the dependencies are installed in your system, you can directly run the commands below to build and install mlpack:
- tar -xvzpf mlpack-3.2.2.tar.gz
- mkdir mlpack-3.2.2/build && cd mlpack-3.2.2/build
- cmake ../
- make -j4 # The -j is the number of cores you want to use for a build
- sudo make install
On many Linux systems, mlpack will be installed by default to /usr/local/lib and you may need to set the LD_LIBRARY_PATH environment variable:
The instructions above are the simplest way to get, build, and install mlpack. If your Linux distribution supports binaries, follow this site to install mlpack using a one-line command depending on your distribution.: MLPACK Installation Instructions. The above method works for all distributions.
Compiling programs with mlpack
- Include the relevant header files in your program (assuming implementing k-means):
- #include <mlpack/methods/kmeans/kmeans.hpp>
- #include <armadillo>
- To compile we need to link the following libraries:
- std=c++11 -larmadillo -lmlpack -lboost_serialization
Implementing K-Means with mlpack
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
K-means is effectively an iterative process where we want to segment the data into certain clusters. First, we assign some initial centroids, so these can be completely random. Next, for each data point, we find the nearest centroid. We will then assign that data point to that centroid. So each centroid represents a class. And once we assign all the data points to each centroid, we will then compute the mean of those centroids.
For a detailed understanding of the K-means algorithm, read this tutorial: The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need.
Here, we will implement k-means using the mlpack library in C++.
We’ll start by including the libraries and header functions for k-means:
Next, we’re going to create some basic variables to set the number of clusters, the dimensionality of the program, the number of samples, and the maximum amount of iterations we want to do. Why? Because K-means is an iterative process.
Next, we are going to create the data. So this is where we’re first going to use the Armadillo library. We’ll create a map class that is effectively a data container:
There you go! This mat class, the object data it has, we’ve given it a dimensionality of two, and it knows it’s going to have 50 samples, and it’s initialized all these data values to be 0.
Next, we will assign some random data to this data class and then effectively run K-means on it. I’m going to create 25 points around the position 1 1, and we can do this by effectively saying each data point is 1 1 or at the position X equals 1, y equals 1. Then we’re going to add some random noise for each of the 25 data points. Let’s see this in action:
Here, for i from 0 to 25, the ith column needs to be this arma type vector at the position 1 1, and then we’re going to add a certain amount of random noise of size 2. So, it’s going to be a 2-dimensional random noise vector times 0.25 to this position, and that’s going to be our data column. And then we’re going to do exactly the same for the point x equals 2 and y equals 3.
And our data is now ready! Time to jump to the training phase.
So first, let’s instantiate an arma mat row type to hold the clusters, and then instantiate an arma mat to hold the centroids:
Now, we need to instantiate the K-means class:
We’ve instantiated the K-means class, and we’ve specified the maximum amount of iterations that we passed through to the constructor. So now, we can go ahead and do the clustering.
We’ll call the Cluster member function of this K-means class. We need to pass in the data, the number of clusters, and then we also need to pass in the cluster’s object and the centroid’s object.
Now, this cluster function will run K-means on this data with a specified number of clusters, and it will then initialize these two objects – clusters and centroids.
We can simply display the results using the centroids.print function. This will give the location of the centroids:
Next, we need to compile. In the terminal, type the below command (again, make sure the directory is properly set):
g++ k_means.cpp -o kmeans_test -O3 -std=c++11 -larmadillo -lmlpack -lboost_serialization && ./kmeans_test
Once compiled, it would have created a kmeans object. Now simply run the program. The output we get is:
And that’s it!
In this article, we saw two popular machine learning libraries that help us implement machine learning models in c++. I love the extensive support available in the official documentation so do check that out. If you need any help, reach out to me below and I’ll be happy to connect with you!
In the next article, we’ll implement some interesting machine learning models like decision trees and random forest. So stay tuned!