An Introduction to Machine Learning Libraries for C++

Alakh Sethi 14 May, 2020

8 min read

Introduction

I love working with C++, even after I discovered the Python programming language for machine learning. C++ was the first programming language I ever learned and I’m delighted to use that in the machine learning space!

I wrote about building machine learning models in my previous article and the community loved the idea. I received an overwhelming response and one query stood out for me (from multiple folks) – are there any C++ libraries for machine learning?

It’s a fair question. Languages like Python and R have a plethora of packages and libraries that cater to different machine learning tasks. So does C++ have any such offering?

Yes, it does! I will highlight two such C++ libraries in this article, and we’ll also see both of them in action (with code). If you’re new to C++ for machine learning, I’ll again recommend going through the first article.

Why Should we use Machine Learning Libraries?

This is a question a lot of newcomers will have. What is the importance of libraries in machine learning? Let me try and explain that in this section.

Let’s say that experienced professionals and industry veterans have put in the hard graft and come up with a solution to a problem. Would you prefer using that or would you rather spend hours trying to re-create the same thing from scratch? That’s usually little point in going for the latter method, especially when you’re working or learning under deadlines.

The great thing about our machine learning community is that a lot of solutions already exist in the form of libraries and packages. Someone else, from experts to enthusiasts, has already done the hard work and put the solution together nicely packaged in a library.

These machine learning libraries are efficient and optimized, and they are tested thoroughly for multiple use cases. Relying on these libraries is what powers our learning and makes writing code, whether that’s in C++ or Python, so much easier and intuitive.

Machine Learning Libraries in C++

In this section, we’ll look at the two most popular machine learning libraries in C+:

SHARK Library
MLPACK Library

Let’s look at each one individually and see how the C++ code works.

1) SHARK C++ Library

Shark is a fast modular library and it has overwhelming support for supervised learning algorithms, such as linear regression, neural networks, clustering, k-means, etc. It also includes the functionality of linear algebra and numerical optimization. These are key mathematical functions or areas that are very important when performing machine learning tasks.

We’ll first see how to install Shark and set up an environment. Then we’ll implement linear regression with Shark.

Install Shark and setup environment (I’ll be doing this for Linux)

Shark relies on Boost and cmake. Fortunately, all the dependencies can be installed with the following command:

              sudo apt-get install cmake cmake-curses-gui libatlas-base-dev libboost-all-dev

To install Shark, run the below commands line-by-line in your terminal:

git clone https://github.com/Shark-ML/Shark.git (you can download the zip file and extract as well)
cd Shark
mkdir build
cd build
cmake ..
make

If you haven’t seen this before that’s not a problem. It’s pretty straightforward and there’s plenty of information online if you get into trouble. For Windows and other operating systems, you can do a quick Google search on how to install Shark. Here is the reference site Shark Installation guide.

Compiling programs with Shark

Include the relevant header files. Suppose we want to implement linear regression, then the extra header files included are:
```
#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h>
#include <shark/Algorithms/Trainers/LinearRegression.h>
```

To compile, we need to link with the following libraries:
```
-std=c++11 -lboost_serialization -lshark -lcblas
```

Implementing Linear Regression with Shark

My first article on this series had an introduction to linear regression. I’ll use the same idea in this article, but this time using the Shark C++ library.

Initialization phase
We’ll start by including the libraries and header functions for linear regression:

Next comes the dataset. I have created two CSV files. The input.csv file contains the x values and the labels.csv file contains the y values. Below is a snapshot of the data:

You can find both the files here – Machine Learning with C++. First, we’ll make data containers for storing the values from CSV files:

Next, we need to import them. Shark comes with a nice import CSV function, and we specify the data container that we want to initialize, and also the location to path file of the CSV:

We then need to instantiate a regression dataset type. Now, this is just a general object for regression, and what we’ll do in the constructor is we pass in our inputs and also our labels for the data.

Next, we need to train the linear regression model. How do we do that? We need to instantiate a trainer and define a linear model:

Training Phase

Next comes the key step where we actually train the model. Here, the trainer has a member function called train. So this function trains this model and finds the parameters for the model, which is exactly what we want to do.

Prediction phase

Finally, let’s output the model parameters:

Linear models have a member function called offset that outputs the intercept of the best fit line. Next, we’re outputting a matrix instead of a multiplier. This is because the model can be generalized (not just linear, it could be a polynomial).

We compute the line of best fit by minimizing the least-squares, aka minimizing the squared loss.

So, fortunately, the model allows us to output that information. The Shark library is very helpful in giving an indication as to how well the model(s) fit:

First, we need to initialize a squared loss object, and then we need to instantiate a data container. The prediction then is computed based on the inputs to the system, and then all we need to do is output the loss, which is computed via passing the labels and also the prediction value.

Finally, we need to compile. In the terminal, type the below command (make sure the directory is properly set):

g++ -o lr linear_regression.cpp -std=c++11 -lboost_serialization -lshark -lcblas

Once compiled, it would have created an lr object. Now simply run the program. The output we get is:

b : [1](-0.749091)

A :[1,1]((2.00731))

Loss: 7.83109

The value of b is a little far from 0 but that is because of the noise in labels. The value of the multiplier is close to 2 which is quite similar to the data. And that’s how you can use the Shark library in C++ to build a linear regression model!

2) MLPACK C++ Library

mlpack is a fast and flexible machine learning library written in C++. It aims to provide fast and extensible implementations of cutting-edge machine learning algorithms. mlpack provides these algorithms as simple command-line programs, Python bindings, Julia bindings, and C++ classes which can then be integrated into larger-scale machine learning solutions.

We’ll first see how to install mlpack and the setup environment. Then we’ll implement the k-means algorithm using mlpack.

Install mlpack and the setup environment (I’ll be doing this for Linux)

mlpack depends on the below libraries that need to be installed on the system and have headers present:

Armadillo >= 8.400.0 (with LAPACK support)
Boost (math_c99, program_options, serialization, unit_test_framework, heap, spirit) >= 1.49
ensmallen >= 2.10.0

In Ubuntu and Debian, you can get all of these dependencies through apt:

sudo apt-get install libboost-math-dev libboost-program-options-dev libboost-test-dev libboost-serialization-dev binutils-dev python-pandas python-numpy cython python-setuptools

Now that all the dependencies are installed in your system, you can directly run the commands below to build and install mlpack:

wget
tar -xvzpf mlpack-3.2.2.tar.gz
mkdir mlpack-3.2.2/build && cd mlpack-3.2.2/build
cmake ../
make -j4 # The -j is the number of cores you want to use for a build
sudo make install

On many Linux systems, mlpack will be installed by default to /usr/local/lib and you may need to set the LD_LIBRARY_PATH environment variable:

export LD_LIBRARY_PATH=/usr/local/lib

The instructions above are the simplest way to get, build, and install mlpack. If your Linux distribution supports binaries, follow this site to install mlpack using a one-line command depending on your distribution.: MLPACK Installation Instructions. The above method works for all distributions.

Compiling programs with mlpack

Include the relevant header files in your program (assuming implementing k-means):

1. #include <mlpack/methods/kmeans/kmeans.hpp>
2. #include <armadillo>

To compile we need to link the following libraries:

1. std=c++11 -larmadillo -lmlpack -lboost_serialization

Implementing K-Means with mlpack

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

K-means is effectively an iterative process where we want to segment the data into certain clusters. First, we assign some initial centroids, so these can be completely random. Next, for each data point, we find the nearest centroid. We will then assign that data point to that centroid. So each centroid represents a class. And once we assign all the data points to each centroid, we will then compute the mean of those centroids.

For a detailed understanding of the K-means algorithm, read this tutorial: The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need.

Here, we will implement k-means using the mlpack library in C++.

Initialization phase

We’ll start by including the libraries and header functions for k-means:

Next, we’re going to create some basic variables to set the number of clusters, the dimensionality of the program, the number of samples, and the maximum amount of iterations we want to do. Why? Because K-means is an iterative process.

Next, we are going to create the data. So this is where we’re first going to use the Armadillo library. We’ll create a map class that is effectively a data container:

There you go! This mat class, the object data it has, we’ve given it a dimensionality of two, and it knows it’s going to have 50 samples, and it’s initialized all these data values to be 0.

Next, we will assign some random data to this data class and then effectively run K-means on it. I’m going to create 25 points around the position 1 1, and we can do this by effectively saying each data point is 1 1 or at the position X equals 1, y equals 1. Then we’re going to add some random noise for each of the 25 data points. Let’s see this in action:

Here, for i from 0 to 25, the ith column needs to be this arma type vector at the position 1 1, and then we’re going to add a certain amount of random noise of size 2. So, it’s going to be a 2-dimensional random noise vector times 0.25 to this position, and that’s going to be our data column. And then we’re going to do exactly the same for the point x equals 2 and y equals 3.

And our data is now ready! Time to jump to the training phase.

Training Phase

So first, let’s instantiate an arma mat row type to hold the clusters, and then instantiate an arma mat to hold the centroids:

Now, we need to instantiate the K-means class:

We’ve instantiated the K-means class, and we’ve specified the maximum amount of iterations that we passed through to the constructor. So now, we can go ahead and do the clustering.

We’ll call the Cluster member function of this K-means class. We need to pass in the data, the number of clusters, and then we also need to pass in the cluster’s object and the centroid’s object.

Now, this cluster function will run K-means on this data with a specified number of clusters, and it will then initialize these two objects – clusters and centroids.

Generating results

We can simply display the results using the centroids.print function. This will give the location of the centroids:

Next, we need to compile. In the terminal, type the below command (again, make sure the directory is properly set):

g++ k_means.cpp -o kmeans_test -O3 -std=c++11 -larmadillo -lmlpack -lboost_serialization && ./kmeans_test

Once compiled, it would have created a kmeans object. Now simply run the program. The output we get is:

Centroids:

0.9497 1.9625

0.9689 3.0652

And that’s it!

End Notes

In this article, we saw two popular machine learning libraries that help us implement machine learning models in c++. I love the extensive support available in the official documentation so do check that out. If you need any help, reach out to me below and I’ll be happy to connect with you!

In the next article, we’ll implement some interesting machine learning models like decision trees and random forest. So stay tuned!

Alakh Sethi 14 May, 2020

Aspiring Data Scientist with a passion to play and wrangle with data and get insights from it to help the community know the upcoming trends and products for their better future.With an ambition to develop product used by millions which makes their life easier and better.

Clustering Intermediate Libraries Linear Regression Machine Learning

Frequently Asked Questions

Responses From Readers

Theta 14 May, 2020

Awesome information! A really good read. Now to see what I can extrapolate from practice and experiments.

1

Show 1 reply

Alakh Sethi 15 May, 2020

Thank you so much! Happy to hear!

Catan++ 15 May, 2020

Please fix the inconsistent coding style of your snippets (comments, indentation etc).

Lori B Guidos 17 May, 2020

Hey..your registration page phone completion setup is a drag. I am a U S resisent and I couldn't get it to work with my phone number. Personally, I don't really want to ahare a phone number with anyone on a registration form...

Shilpa 18 May, 2020

This is truly helpful ! Thanks for sharing. Do you have this for Python?

1

Show 1 reply

Alakh Sethi 18 May, 2020

Thank you so much! These libraries are designed to work for machine learning problems in C++. For python related libraries you can see other related amazing articles on analytics vidhya blog.