Using the Power of Deep Learning for Cyber Security (Part 1)

Last Updated : 14 May, 2020

12 min read

Introduction

The majority of the deep learning applications that we see in the community are usually geared towards fields like marketing, sales, finance, etc. We hardly ever read articles or find resources about deep learning being used to protect these products, and the business, from malware and hacker attacks.

While the big technology companies like Google, Facebook, Microsoft, and Salesforce have already embedded deep learning into their products, the cybersecurity industry is still playing catch up. It’s a challenging field but one that needs our full attention.

In this article, we briefly introduce Deep Learning (DL) along with a few existing Information Security (hereby referred to as InfoSec) applications it enables. We then deep dive into the interesting problem of anonymous tor traffic detection and also present a DL-based solution to detect TOR traffic.

The target audience for this article is data science professionals who are already working on machine learning projects. The content of this article assumes that you have foundation knowledge of machine learning and are currently either a beginner, or are exploring, deep learning and it’s use cases.

The below pre-reads are highly recommended to get the most out of this article:

The Current State of Deep Learning Systems in InfoSec
A Brief Overview of Feed Forward Neural Network
Case Study: Tor Traffic Detection using Deep Learning
Data Experiments – Tor Traffic Detection

The Current State of Deep Learning Systems in InfoSec

Deep learning is not a silver bullet that can solve all the InfoSec problems because it needs extensive labeled datasets. Unfortunately, no such labeled datasets are readily available. However, there are several InfoSec use cases where the deep learning networks are making significant improvements to the existing solutions. Malware detection and network intrusion detection are two such areas where deep learning has shown significant improvements over the rule-based and classic machine learning-based solutions.

Network intrusion detection systems are typically rule-based and signature-based controls that are deployed at the perimeter to detect known threats. Adversaries change the malware signatures and easily evade the traditional network intrusion detection systems. Quamar et al. [1], in their IEEE transaction paper, showed deep learning (DL)-based systems using self-taught learning to be promising in detecting unknown network intrusions. Traditional security use cases such as malware detection and spyware detection have been tackled with deep neural net-based systems [2].

The generalization power of DL-based techniques is better compared to traditional ML-based approaches. Jung et al.’s [3] DL based system can even detect zero-day malware. Daniel Gibert [2], a Ph.D. graduate from the University of Barcelona, has done extensive work related to convolutional neural networks (CNN, a type of DL architecture) and malware detection. In his Ph.D. thesis, he says that CNNs can detect even polymorphic malware.

The DL-based neural nets are now getting used in User and Entity Behaviour Analytics (UEBA). Traditionally, UEBA employs anomaly detection and machine learning algorithms which distill the security events to profile and baseline every user and network element in the enterprise IT environment. Any significant deviations from the baselines were triggered as anomalies that further raised alerts to be investigated by the security analysts. UEBA enhanced the detection of insider threats, albeit to a limited extent.

Now, deep learning-based systems are used to detect many other types of anomalies. Paweł Kobojek from Warsaw university, Poland [4] uses keystroke dynamics to verify the user using an LSTM network. Jason Trost, director of security data engineering at Capital One, has published several blogs [5] that have a list of technical papers and talks on applying deep learning in InfoSec.

A Brief Overview of Feed Forward Neural Network

The artificial neural network is inspired from the biological neural network. Neurons are the atomic unit of a biological neural network. Each neuron consists of dendrites, nucleus, and axons. It receives signals through dendrites and is carried out through axons (Figure 1 below). The computations are performed in the nucleus. The entire network is made up of a chain of neurons.

AI researchers borrowed this idea to develop the artificial neural network (ANN). In this setting, each neuron accomplishes three actions:

it accumulates input from various other neurons or inputs in a weighted manner
it sums up all input signals
based on the summed value, it calls an activation function

Each neuron thus can classify whether a set of inputs belong to one class or another. This power is limited when only a single neuron is used. However, coining a set of neurons makes it a powerful machinery for classification and sequence labelling tasks.

Figure 1: Greatest inspiration that we can get is from the nature – figure depicts a biological neuron and an artificial neuron.

A set of neuron layers can be used to create a neural network. The network architecture differs based on the objective it needs to achieve. A common network architecture is a Feed Forward Neural Network (FFN). Neurons are arranged linearly without any cycles to form a FFN. It is called feed forward because information travels in the forward direction inside the network, first through the input neurons layer, then through the hidden neuron layers, and the output neurons layer (Figure 2 below).

Figure 2: A feed forward network with two hidden layers

Like any supervised machine learning model, the FFN needs to be trained using labeled data. The training is in the form of optimizing the parameters by reducing the error between the output value and the true value. One such important parameter to optimize is the weight each neuron gives to each of its input signals. For a single neuron, the weight can be easily computed using the error.

However, when a set of neurons are collated in multiple layers, it is challenging to optimize the neuron weights in multiple layers based on the error computed at the output layer. The backpropagation algorithm helps to address this issue [6]. Backpropagation is an old technique which comes under the branch of computer algebra. Here, automatic differentiation is used to calculate the gradient that is needed in the calculation of the weights to be used in the network.

In a FFN, based on activation of each linked neuron, the output is obtained. The error is propagated layer by layer. Based on the correctness of the output with the final outcome, the error is calculated. This error is then in turn back propagated to fix errors of internal neurons. For each data instance, the parameters are optimized by going through multiple iterations.

Case Study: Tor Traffic Detection using Deep Learning

The primary goal of cyber-attacks is to steal the enterprise customer data, sales data, intellectual property documents, source codes and software keys. The adversaries exfiltrate the stolen data to remote servers in encrypted traffic along with the regular traffic.

Most often adversaries use an anonymous network that makes it difficult for the security defenders to trace the traffic. Moreover, the exfiltrated data is typically encrypted, rendering rule-based network intrusion tools and firewalls to be ineffective. Recently, anonymous networks have also been used for C&C by specific variants of ransomware/malware. For instance, Onion Ransomware [7] uses the TOR network to communicate with its C&C.

Figure 3: An illustration of TOR communication between Alice and a destination server. The communication starts with Alice requesting a path to the server. TOR network gives the path which is AES encrypted. The randomization of the path happens inside the TOR network. The encrypted path of the packet is shown in red. Upon reaching the exit node, which is the periphery node of the TOR network, the plain packet is transferred to the server.

Anonymous network/traffic can be accomplished through various means. They can be broadly classified into:

Network based (TOR, I2P, Freenet)
Custom OS based (subgraph OS, Freepto)

Among them, TOR is one of the more popular choices. TOR is a free software that enables anonymous communication over the internet through a specialized routing protocol known as the onion routing protocol [9]. The protocol depends on redirecting internet traffic over various freely hosted relays across the world. During the relay, like the layers of an onion peel, each HTTP packet is encrypted using the public key of the receiver.

At each receiver point, the packet can be decrypted using the private key. Upon decryption, the next destination relay address is revealed. This carries on until the exit node of the TOR network is met, where the decryption of the packet ends, and a plain HTTP packet is forwarded to the original destination server. An example routing scheme between Alice and the server is depicted in the above Figure 3 for illustration.

The original intent of launching TOR was to safeguard the privacy of users. However, adversaries have hijacked the good Samaritan objective to use it for various nefarious means instead. As of 2016, around 20% of the Tor traffic accounts for illegal activities. In an enterprise network, TOR traffic is curtained by not allowing the installation of the TOR client or blocking the Guard or Entry node IP address.

However, there are numerous means through which adversaries and malware can get access to the TOR network to transfer data and information. The IP blocking strategy is not a sound strategy. Adversaries can spawn different IPs to carry out the communication. A bad bot landscape report by distil networks [5] shows that 70% of automated attacks in 2015 used multiple IPs, and 20% of automated attacks used over 100 IPs.

TOR traffic can be detected by analyzing the traffic packets. This analysis can be on the TOR node, or in between the client and the entry node. The analysis is done on a single flow of packet. Each flow constitutes a tuple of source address, source port, destination address, and destination port.

Network flows for different time intervals are extracted and analysis is carried on them. G. He et al. in their paper “Inferring Application Type Information from Tor Encrypted Traffic” extracted burst volumes and directions to create a HMM model to detect the TOR applications that might be generating that traffic. Most of the popular works in this area leverage time-based features along with other features like size and port information to detect TOR traffic.

We take inspiration from Habibi et al’s “ Characterization of Tor Traffic using Time based Features” paper and follow a time-based approach over extracted network flow to detect TOR traffic for this article. However, our architecture uses a plethora of other meta-information that can be obtained to classify the traffic. This is inherently due to the Deep Learning architecture that has been chosen to solve this problem.

Data Experiments – Tor Traffic Detection

We obtained the data from Habibi Lashkari et al. [11] at the University of New Brunswick for the data experiments done in this article. Their data consists of features extracted from the analysis of the university internet traffic. Extracted meta information from the data is given in the table below:

Table 1: Meta information parameters obtained from [1]

Meta-Information parameter	Parameter Explanation
FIAT	Forward Inter Arrival Time, the time between two packets sent forward direction (mean, min, max, std).
BIAT	Backward Inter Arrival Time, the time be- tween two packets sent backwards (mean, min, max, std).
FLOWIAT	Flow Inter Arrival Time, the time between two packets sent in either direction (mean, min, max, std).
ACTIVE	The amount of time time a flow was active before going idle (mean, min, max, std).
IDLE	The amount of time time a flow was idle before becoming active (mean, min, max, std).
FB PSEC	Flow Bytes per second. Flow packets per second. duration: The duration of the flow.

Apart from these parameters, other flow-based parameters are also included. A sample instance from the dataset is shown in Figure 4 below:

Figure 4: An instance of the dataset used for this article.

Please note that source IP/port and destination IP/port, along with the protocol field, have been removed from the instance as they overfit the model. We process all other features using a deep feed forward neural network with N hidden layers. The architecture of the neural network is shown in Figure 5 below.

Figure 5: Deep learning network representation used for TOR traffic detection.

The hidden layers vary between 2 to 10. We found N=5 to be optimal. For activation, Relu is used for all the hidden layers. Each layer of the Hidden layers is dense in nature and of dimension 100.

model = Sequential()
model.add(Dense(feature_dim, input_dim= feature_dim, kernel_initializer='normal', activation='relu'))
for _ in range(0, hidden_layers-1):
    model.add(Dense(neurons_num, kernel_initializer='normal', activation='relu'))
model.add(Dense(1,kernel_initializer='normal', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=["accuracy"])

Figure 6: A Python Code Snippet of the FFN in Keras.

The output node is activated by a sigmoid function. This was used as the output is a binary classification – Tor or Non-Tor.

We used Keras with Tensorflow in the backend to train the DL module. Binary cross entropy loss was used for optimizing the FFN. The model was trained for different epochs. Figure 7 below shows training simulation for a run depicting the increasing performance and decreasing loss value as the number of epochs increase.

Figure 7: Tensorboard generated statics depicting the network training process

The results of the deep learning system were compared with various other estimators. Standard classification metrics of Recall, Precision and F-Score were used to measure the efficacy of the estimators. Our DL-based system was able to detect the TOR class well. However, it is the Non-Tor class that we need to give more importance to. It is seen that a Deep Learning-based system can reduce the false positive cases for Non-Tor category samples. The results are shown in the table below:

Table 2: The output of ML and DL Models for the Tor Traffic Detection experiment

Classifier used	Precision	Recall	F-Score
Logistic Regression	0.87	0.87	0.87
SVM	0.9	0.9	0.9
Naïve Bayes	0.91	0.6	0.7
Random Forest	0.96	0.96	0.96
Deep Learning	0.95	0.95	0.95

Among various classifiers, Random Forest and Deep learning based approaches perform better than the rest. The result shown is based on 55,000 training instances. The dataset used in this experiment is comparatively smaller than typical DL-based systems. As the training data increases, performance would increase further for both DL-based and Random forest classifier.

However, for large datasets, a DL-based classifier typically outperforms other classifiers, and it can be generalised for similar types of applications. For example, if one needs to train a classifier to detect the application used by TOR, then only the output layer needs retraining, and all the other layers can be kept the same. Whereas other ML-classifiers will need to be retrained for the entire dataset. Keep in mind that retraining the model may take significant computing resources for large datasets.

End Notes

Anonymized traffic detection is a nuanced challenge that every enterprise faces. The adversaries use TOR channels to exfiltrate data in anonymous mode. Current approaches by tor traffic detection vendors depend on blocking known entry nodes of the TOR network. This is not a scalable approach and can be easily bypassed. A generic method is to use deep learning-based techniques.

In this article, we presented a deep learning-based system to detect the TOR traffic with high recall and precision. Let us know your take on the current state of deep learning, or if you have any alternate approaches, in the comments section below.

References

[1]: Quamar Niyaz, Weiqing Sun, Ahmad Y Javaid, and Mansoor Alam, “A Deep Learning Approach for Network Intrusion Detection System,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
[2]: Daniel Gibert, “Convolutional Neural Networks for Malware Classification,” Thesis 2016.
[3]: Wookhyun Jung, Sangwon Kim,, Sangyong Choi, “Deep Learning for Zero-day Flash Malware Detection,” IEEE security, 2017.
[4]: Paweł Kobojek and Khalid Saeed, “Application of Recurrent Neural Networks for User
Verification based on Keystroke Dynamics,” Journal of telecommunications and information technology, 2016.
[5]:Deep Learning Security Papers, http://www.covert.io/the-definitive-security-datascience-and-machinelearning-guide/#deep-learning-and-security-papers, accessed on May 2018.
[6]: “Deep Learning,” Ian Goodfellow, Yoshua Bengio, Aaaron Courville; pp 196, MIT Press, 2016.
[7]: “The Onion Ransomware,” https://www.kaspersky.co.in/resource-center/threats/onion-ransomware-virus-threat, Retrieved on November 29, 2017.
[8]: “5 best alternative to TOR.,” https://fossbytes.com/best-alternatives-to-tor-browser-to-browse-anonymously/, Retrieved on November 29,2017.
[9]: Tor. Wikipedia., https://en.wikipedia.org/wiki/Tor_(anonymity_network), Retrieved on November 24, 2017.
[10]: He, G., Yang, M., Luo, J. and Gu, X., “ Inferring Application Type Information from Tor Encrypted Traffic,” Advanced Cloud and Big Data (CBD), 2014 Second International Conference on (pp. 220-227), Nov. 2014.
[11]: Habibi Lashkari A., Draper Gil G., Mamun M. and Ghorbani A., “Characterization of Tor Traffic using Time based Features,” Proceedings of the 3rd International Conference on Information Systems Security and Privacy – Volume 1, pages 253-262, 2017.
[13]: Juarez, M., Afroz, S., Acar, G., Diaz, C. and Greenstadt, R., “A critical evaluation of website fingerprinting attacks,” Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (pp. 263-274), November 2014
[14]: Bai, X., Zhang, Y. and Niu, X., “Traffic identification of tor and web-mix,” Intelligent Systems Design and Applications, ISDA’08. Eighth International Conference on (Vol. 1, pp. 548-551). IEEE, November 2008

About the Authors

Dr. Satnam Singh, Chief Data Scientist – Acalvio Technologies

Dr Satnam Singh is currently leading security data science development at Acalvio Technologies. He has more than a decade of work experience in successfully building data products from concept to production in multiple domains. In 2015, he was named as one of the top 10 data scientists in India. To his credit, he has 25+ patents and 30+ journal and conference publications.

Apart from holding a PhD degree in ECE from University of Connecticut, Satnam also holds a Masters in ECE from University of Wyoming. Satnam is a senior IEEE member and a regular speaker in various Big Data and Data Science conferences.

Balamurali A R, Member Technical Staff (Data Science) at Acalvio

Balamurali A R is a member of the data science team at Acalvio. He is a graduate from IIT Mumbai, holds a Ph.D in Computer Science and has previously worked with companies like Samsung and IBM.

Advanced Cyber Security Deep Learning Python Use Cases

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Souraj Mishra

Where can I find the data for the case study?

Show 1 reply

Pulkit Sharma

Hi Souraj, The data has been obtained from Habibi Lashkari et al. You can find it here.

Rajaram

Its indeed a great article.Thanks. I am a data analyst who would like to learn more about cyber security,and Machine Learning applications over cyber security. As my knowledge on cyber security is very limited,is it advisable to go for an Ethical Hacking course?

Krisna rao

Hi I want to learn new technologies. 😊

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Using the Power of Deep Learning for Cyber Security (Part 1)

Introduction

Table of Contents

The Current State of Deep Learning Systems in InfoSec

A Brief Overview of Feed Forward Neural Network

Case Study: Tor Traffic Detection using Deep Learning

Data Experiments – Tor Traffic Detection

End Notes

References

About the Authors

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set