Pearl Miglani — June 21, 2021

This article was published as a part of the Data Science Blogathon

## Introduction:

Every day on the internet, more than 2.5 quintillion bytes of data are created. This data is increasing in terms of variety, velocity and volume, hence called big data. To analyze this data, one has to collect this data, store it in a safe place, clean it and then perform analysis. One of the major problems faced by big data engineers is dealing with unuseful or redundant data. A lot of time and memory is used to store and analyze this extra data which turns out to be fruitless in the end. Thus, the removal of duplicate data becomes extremely essential to cut the analysis cost and reduce redundancy.

Data cleaning can be done using various techniques but before cleaning the data, it is necessary to know the amount of useful data present in the dataset. Therefore, before the removal of duplicate data from a data stream or database, it is necessary to have knowledge of distinct or unique data present. A way to do so is by hashing the elements of the universal set using the Flajolet Martin Algorithm. The FM algorithm is used in a database query, big data analytics, spectrum sensing in cognitive radio sensor networks, and many more areas. It shows superior performance as compared with many other methods to find distinct elements in a stream of data.

## Flajolet Martin Algorithm:

Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number of unique elements in a data stream or database in one pass. The highlight of this algorithm is that it uses less memory space while executing.

Pseudo Code-Stepwise Solution:

1. Selecting a hash function h so each element in the set is mapped to a string to at least log2n bits.

2. For each element x, r(x)= length of trailing zeroes in h(x)

3. R= max(r(x))

=> Distinct elements= 2R

## Reasons for using Flajolet Martin algorithm:

Let us compare this algorithm with our conventional algorithm using python code. Assume we have an array(stream in code) of data of length 20 with 8 unique elements. Using the brute force approach to find the number of unique elements in the array, each element is taken into consideration. Another array(st_unique in code) is formed for unique elements. Initially, the new array is empty(st_unique length equals zero), so naturally, the first element is not present in it. The first element is considered to be unique as it does not exist in the new array and thus a copy of the first element is inserted into the new array(1 is appended to st_unique).

Similarly, all the elements are checked, if they are already present in the new array, they are not considered to be unique, else a copy of the element is inserted into the new array. Running the brute force algorithm for our array, we will get 8 elements in the new array. If each element takes 20 bytes of data, the new array will take 8*20= 160 bytes memory to run the algorithm.

```stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]
print('Using conventional Algorithm:')
start_time = time.time()
st_unique=[]
for i in stream:
if i in st_unique:
continue
else:
st_unique.append(i)

print('distinct elements',len(st_unique))
print("--- %s seconds ---" % (time.time() - start_time))```

For the same array, if use the FM algorithm for the same array, we define a variable(maxnum in code) that stores the maximum number of zeroes at the end. For each value in the array(stream in code), we run a loop to covert its hash function in the form ax+b mod c,(a=1, b=6 and c=32 in this case) into binary(we place [2:] at the end because in python when converted to binary, the number starts with ‘0b’). We run another loop to find if the number of zeroes at the end exceeds the maximum number of zeroes.

In this case, if each variable occupies 2 bytes of data, the whole program takes 4*20= 80 bytes of data, i.e half of the memory used in the above case. Here, we have considered the variables maxnum, sum, val, and j. We have not considered i, time, and stream as they are common in both of the codes.

```stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]
print('Using Flajolet Martin Algorithm:')
import time
start_time = time.time()

maxnum=0
for i in range(0,len(stream)):
val= bin((1*stream[i] + 6) % 32)[2:]

sum=0
for j in range(len(val)-1,0,-1):

if val[j]=='0':
sum+=1
else:
break
if sum>maxnum:
maxnum=sum

print('distict elements', 2**maxnum)
print("--- %s seconds ---" % (time.time() - start_time))```

For small values of m(where m is the number of unique elements), the brute force approach can work, but for large data sets or data streams, where m is very large, a lot of space is required. The compiler may not let us run the algorithm in some cases.

This is where the Flajolet Martin Algorithm can be used. Not only does it occupy less memory, but it also shows better results in terms of time in seconds when the python code is run which can be shown in our output as we calculated seconds taken by both algorithms by using time.time() in python. As shown in the output, it can clearly be said that the FM algorithm takes very little time as compared to the conventional algorithm.

## Drawbacks of the algorithm:

It is important to choose the hash parameters wisely while implementing this algorithm as it has been proven practically that the FM Algorithm is very sensitive to the hash function parameters. The hash function used in the above code is of the form ax+b mod c where x is an element in the stream and a,b,c are 1,6,32 respectively where a,b,c are the hash function parameters. For any other values of a,b,c the algorithm may not give the same result. Thus, it is important to observe the stream and then assign proper values to a,b and c.

## Conclusion:

This approach is used to maintain a count of distinct values seen so far, given a large number of values. For example, getting an approximation of the number of distinct URLs surfed by a person on the web. Many companies want to check how many unique users logged in to their website the previous day to check if their advertisement was successful or not. Here, the FM algorithm is an excellent solution for these companies.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion. 