Log Parsing using Regular Expressions and Scala in Spark

Uday Shaw 15 Mar, 2022

3 min read

This article was published as a part of the Data Science Blogathon.

Introduction

In this article, I am going to explain, how can we use log parsing with Spark and Scala to get meaningful data from unstructured data. In my experience, after parsing a lot of logs from different sources, I have found no data is unstructured. There is always some meaningful way to look at it and understand it. This is the way that we understand unstructured data. In the case of logs, there are configuration files, that contain in what position which field gets written. So to start with I will take a simple NGINX logline and explain how it can be parsed using regular expressions. But before that let’s go through a brief of the tools involved.

What is Regular Expression?

Regular Expression is a combination of strings that are used to search patterns in text. It is widely used with almost all the programming languages that are available, even NoSQL databases also allow using regular expressions while querying them.

What is Spark and Scala?

Spark is a data processing engine that is used due to its parallel processing abilities. It is written in a programming language called Scala. Spark can be used with Java, Python, R and Scala. But since it is written in Scala, I have seen in many projects, they prefer to build their data pipelines in Scala.

Log Parsing

Let’s take an example,

Say I am running NGINX, in one of my servers and there is a need to monitor its access log, so that failures, API usage and different meaningful data points can be derived from it.

A sample logline from the NGINX access log looks like this:

47.29.201.179 – – [28/Feb/2019:13:17:10 +0000] “GET /?e=1 HTTP/2.0” 200 5316 “https://sampleexample.com/?e=1” “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36”

The configuration that defines the log format can be found in nginx.conf file, which looks like :

log_format main ‘$remote_addr – $remote_user [$time_local] “$request” $status $body_bytes_sent “$http_referer” “$http_user_agent”‘;

To human eyes the correlation is clear and looks like.

$remote_addr	47.29.201.179
–	–
$remote_user	–
[$time_local]	[28/Feb/2019:13:17:10 +0000]
$request	GET /?e=1 HTTP/2.0
$status	200
$body_bytes_sent	5316
$http_referer	https://sampleexample.com/?e=1
$http_user_agent	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36

Implementation

But it is a little difficult to make this correlation programmatically.

To start with, we have to write a regular expression, that can recognise patterns from the log. I have written one, and the explanation follows. You can write your own regular expression or make changes to it as per your use case.

(d+.d+.d+.d+) – – ([.*?]) (“.*?”) (d+) (d+) (“.*?”) (“.*?”)

The correlation between regular expression and log format is given in below table:

$remote_addr	(d+.d+.d+.d+)
–	–
$remote_user	–
[$time_local]	([.*?])
“$request”	(“.*?”)
$status	(d+)
$body_bytes_sent	(d+)
“$http_referer”	(“.*?”)
“$http_user_agent”	(“.*?”)

Now, let’s initialize the logline into a data frame
as in most cases you will have a lot of it and data frames are the most convenient
for data processing. You can follow the below code for the same.

val log="""47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?e=1 HTTP/2.0" 200 5316 "https://sampleexample.com/?e=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36""""
val df=Seq((log)).toDF
df.show(false)

Now, Let’s create a UDF to parse logs using regular expressions.

import org.apache.spark.sql.functions.{col,udf}
val regex_split=udf((value:String, regex:String)=>{regex.r.unapplySeq(value)})
val regularExpression="""(d+.d+.d+.d+) - - ([.*?]) (".*?") (d+) (d+) (".*?") (".*?")"""
val df_parsed=df.withColumn("parsed_fields",regex_split($"value",lit(regularExpression)))

In the above code block, “regex_split” UDF is created to parse the log data, Now we use the UDF to parse the log and see the data output. You can see that “df_parsed”, the output data frame has 2 fields, one of which is an array that is generated using the UDF.

Now we have to separate the elements of the fields in the array to different columns. The below code shows how that can be done using a map variable.

val colname_position_map=Map((0,"remote_addr"),(1,"time"),(2,"request"),(3,"status"),(4,"body_bytes_sent"),(5,"http_referer"),(6,"http_user_agent"))
df_parsed.select(colname_position_map.keys.toList.map(x=>col("parsed_fields")(x).as(colname_position_map(x))):_*).show

The final output looks like