When you search for security data science on the internet, it’s difficult to find resources with crisp and clear information about the use cases, methods and limitations in Information Security (hereby referred to as InfoSec). There’s usually always some marketing material attached to it. So, I thought of summarising my knowledge and InfoSec experience in this article.
The intended audience for this article is:
- Budding security data scientists
- Security analysts
- Threat hunters
- InfoSec professionals
- Anyone who wants to explore a career path in InfoSec and data science.
Table of Contents
- Why are so many ransomware attacks and data breaches happening now?
- What are the challenges in InfoSec?
- Why InfoSec needs data science?
- What are the data science challenges for InfoSec?
- What are the key data sources and use cases for security data science?
- How did security data science evolve over the time?
- How to model an InfoSec use case into a data science problem?
Why are so many ransomware attacks and data breaches happening now?
There are several reasons for this and a few major ones are listed below:
- The attack surface is increasing and the network perimeter has now been dissolved due to mobile, cloud, BYOD, etc.
- Attackers have found a highly efficient way to make quick money using ransomware. In fact, ransomware is now available as a service on the dark web. Due to this, novice attackers can also simply leverage the ransomware service and focus more on the ransom extortion.
- Attackers are also using more tools, like polymorphic malware and zero-day vulnerabilities, to evade the current InfoSec tools.
- The InfoSec defense team has a limited number of sensors/cameras to watch the adversary movements within the enterprise network (these are the so-called “Insider Threats”). The adversaries are almost always in an advantageous position as they can freely move within the enterprise network after they have compromised a few users.
What are the challenges in InfoSec?
- Information Security is a highly skewed and asymmetric problem. The defense team may need to write nearly 10,000 lines of code to fix a vulnerability and secure the system. However, an adversary just needs to find another vulnerability and come up with just 5-6 lines of code that can easily evade the security patch.
- There are multiple ‘doors’ for an adversary to ‘walk’ into the enterprise network. It’s difficult to guard all the gates because the security tools/gates (like firewalls, network-based intrusion detection/prevention tools, host-based intrusion detection tools, anti-virus, etc.) at times cannot distinguish between a genuine user versus an adversary who has compromised a user’s account.
- The adversaries use the same commands, scripts, and tools that are used by the system administrators. Depending on the attacker’s skill set, they either use existing tools, like Nmap, Metasploit, PowerSploit. etc., or any home-grown scripts to execute their attack.
Why does InfoSec need data science?
When the attackers are within the enterprise network, they first need to figure out where they are. Once they accomplish this, they move towards their targets, and carry out the attack. During these reconnaissance queries and movements, they usually leave some traces or signals. These signals are present in the data, and their presence can be detected using data science to raise timely alerts.
Earlier, we used to bring all the data to a security data lake called SIEM (security information and event management). But now with the advancements in data science, correlations across multiple events can be performed in real-time. Using algorithms, we can connect the dots and find the patterns which used to be difficult to find manually owing to the lack of security analysts.
One of the key advantages of data science-based systems is that they learn from the decisions taken by the security analysts. After training the systems extensively, they can also start taking the same preventive measures/actions as the security analysts.
What are the data science challenges for InfoSec?
The problems in InfoSec are multi-dimensional, that is, thousands of features present in tons of data sources. We need to detect the adversary presence by mining petabytes of machine logs. This is a complex and difficult problem, because the signal to noise ratio is very low. Also, connecting the attack sequences among isolated and rare signal events is a significant challenge.
The majority of the security data has no labels, which makes it difficult to apply deep learning networks to a large number of InfoSec use cases. However, the industry is tackling this problem by generating class labels for a few use cases at a time.
For example, detection of malware, and the ranking of malicious websites and DNS domains, is primarily done using Machine Learning techniques. Another successful use case of data science for security is making a baseline of each user/network device/entity within the network and comparing it with the real-time data to find rare/abnormal behavior and raising anomalies.
These user behavior-based anomalies are certainly more than 100 times lesser than rule-based anomalies. However, their magnitudes are still quite high and a large number of them end up being false positives. In short, security data science is not a silver bullet for InfoSec. We need to marry multiple technologies along with it to improve the defense.
Figure 1: Data Sources and Use Cases for Security Data Science
What are the key data sources and use cases for security data science?
The InfoSec domain has a large number of logs. The data volume and variety depend on the organisation’s size and domain. Most of the big MNCs use 20-50 InfoSec tools and record the data into hot and cold storage. They use so-called “security data lakes” or Security Information and Event Management (SIEM) tools to store recent data (e.g. DNS logs, authentication logs, Windows security logs, etc.) for monitoring the threats. Data older than a few months, or high volume data (e.g. NetFlow, Bro logs etc.), is pushed to cold storage in Hadoop-based systems.
Here is a list of typical data sources in InfoSec:
- Endpoints: Processes, applications, host-based IDS alerts, file system changes, registry changes, operating system logs, anti-virus alerts.
- Network: Network packets and flows, network IDS/IPS alerts, network topology, firewall logs, HTTP proxy logs, DNS logs, Netflow, Bro logs.
- Authentication: Windows/Mac/Linux authentication logs, Windows security logs, Active directory logs, Privilege user management logs.
- Threat Intelligence: Indicators of compromise, malicious domain names, IP addresses from peer organizations and open source communities, malware signatures.
- Asset management logs
- Vulnerability logs
All these logs provide a lot of visibility about the adversary’s presence and activities. The table below summarises various use cases according to the data source type. Figure 1 (above) shows that these use cases are typically solved using anomaly detection and ML techniques.
Table 1: Use Cases for the Security Data Science
|Network logs – Use cases||Endpoint logs – Use cases||Authentication logs – Use cases|
|1||Unusual volume of network traffic from a host/network device||Anomalous New Listening Ports/Services/Processes||Excessive Failed Logins – Brute Force Attack|
|2||Network intrusion detection (Scanning, Spoofing detection etc.)||Host with Excessive No. of Listening Ports/Services/Processes||Default Account Usage|
|3||Application attack detection (Top 10 OWASP attacks)||Malware detection and classification||User Behavior Analytics|
|4||Reputation of DNS servers and CnC Detection||Spyware, Ransomware detection||Active directory and Privilege user monitoring|
|5||Substantial increase in Port activity/Events||Prohibited Process/Service creation||Geographically improbable authentication detection|
|6||Detection of unapproved port activity||Host with multiple infections||Brute force access behavior detection|
|7||DNS tunnel attack detection||Unusual registry changes||Spam Mitigation|
How did security data science evolve over time?
Security data science has evolved in three phases as shown in Figure 2 below.
Phase 1 – Rule-based and Anomaly Detection systems
Since the 1990s, data science has played an increasingly important role in information security. This started with rules-based approaches to finding anomalies in intrusion detection system (IDS) and intrusion prevention system (IPS). Most of the firewall, network/host IDS/IPS are either rule-based or anomaly detection-based systems.
Rules are written by security experts and the system raises alerts based on the rules, for instance, failed authentication beyond a specific count indicates a brute force attack. However, these rules don’t capture the dynamic nature of events and context around the events.
Anomaly detection systems are based on the normal behavior models of hosts and networks. Whenever there is significant deviation from the normal behavior, then they raise alerts. Anomaly detection algorithms, such as Clustering, Robust-PCA, SVD, One-Class SVM, DB Scan and KDE, are used to detect anomalous events.
Anomaly-based algorithms are used in networks to detect:
- anomalous ports
- unusual traffic from a host
- excessive DNS failures
- endpoints having unusual processes/applications/registry changes
- users/hosts having unusual behaviors
Unfortunately, most of the AD systems raise high false alarms and need a lot of security analysts to validate the alerts.
Figure 2: Evolution of the Security Data Science
Phase 2 – Security Data Lakes/SIEM
In the early 2000s, the second generation of security tools evolved. These facilitated triaging the alerts by correlating multiple data sources in a security data lake called as security information and event management (SIEM) tools. SIEM was successful when the data was large, but in the Big data era, they are slow and are missing an intelligence layer.
Phase 3 – UEBA, Malware detection
With the advances in Big data frameworks, a new form of security data science has evolved. Now, it is possible to boil the ocean of raw logs in real time and raise alerts. This gave rise to user and entity behavior analytics (UEBA) that leverages Hadoop/Spark and anomaly detection techniques to raise real-time alerts whenever there is abnormal behavior of hosts/users within the enterprise network.
This has enabled enterprises to detect insider attacks. However, the anomaly-based solutions have a drawback of generating a large number of false-positive alerts. Each investigation of a false-positive alert adds a significant burden to an already overloaded security analyst.
Another emerging area that is rapidly gaining traction is endpoint security where deep learning is used to detect and classify malware in real time. Supervised ML algorithms such as Deep Learning Networks (ANN, RNN, CNN), Random Forest and XGBoost are used to classify malicious scripts vs benign scripts, detect DNS tunnels, detect C&C servers, detect malware, detect known network scans, application attacks, and many more known threats that have labels available for training the system.
Phase 4 – Deception-triggered data science
In this evolution, we are bringing a new paradigm shift for the InfoSec field. In this security defense, we first deploy deceptions (reincarnation of honeypots, honeynets, honeywords etc.) in the enterprise network. Then, we leverage data science to profile adversary behavior and their movements within the network. We termed this research “deception-triggered data science.”
Deception-triggered data science is significantly different from conventional security data science. The latter primarily leverages anomaly detection techniques to identify anomalous behavior in network traffic, or user/host/network element behavior. Whereas deception-triggered data science starts from a real attack, i.e., anomaly announced by a deception event, and hence does not require anomaly detection algorithms.
Deception alerts are high fidelity alerts. Data science correlates other security event data with these high fidelity alerts to generate a lot of insights about the adversary behavior. In this approach, we collect and describe the context around a deception alert instead of looking for anomalies like a needle in a haystack. Instead, this kind of data science can focus on capturing everything about how an attack begins and proceeds as it progresses.
To draw on a metaphor, comparing deception-triggered data science to brute force security data science is like boiling a cup of tea rather than boiling an entire ocean. The former is practical, clever and elegant; the latter is expensive, cumbersome and impractical. Deception triggered data science significantly reduces the false positives thereby reducing the overall infrastructure and maintenance cost associated with security-related chores.
More details about this topic can be found in my talk at Splunk .conf 2016 (#4 in the references at the bottom of this article). To read more about deception, please refer to Almeshekah and Spafford’s paper listed in point #5.
Figure 3: Security Data Science Methods
How to model an InfoSec use case into a data science problem?
Most of the InfoSec problems can be modeled using anomaly detection and machine learning techniques, as shown with an example in Figure 3 above. I have shared the details of algorithms, feature engineering and data science pipeline for several InfoSec case studies during my webinars.
The video titles mentioned below, along with the timeline, contain the InfoSec use cases. The links to these videos are in the references section at the bottom of this article.
- Data exfiltration detection using anomaly detection [Webinar , timeline – 26:36-37:50]
- Detect Command and Control (C&C) Center [Webinar , timeline – 19:42-29:20]
- PowerShell Obfuscation and Detection [Webinar , timeline -29:20-38:45]
I have put together a link to the datasets, papers and talks related to Security Data Science. In the references below, use [6,7,8] Github links to learn further. Enjoy learning 🙂
 DataHack Summit
About the Author
Dr Satnam Singh is currently leading security data science development at Acalvio Technologies. He has more than a decade of work experience in successfully building data products from concept to production in multiple domains. In 2015, he was named as one of the top 10 data scientists in India. To his credit, he has 25+ patents and 30+ journal and conference publications.
Apart from holding a PhD degree in ECE from University of Connecticut, Satnam also holds a Masters in ECE from University of Wyoming. Satnam is a senior IEEE member and a regular speaker in various Big Data and Data Science conferences.