Which way to go for machine learning in cybersecurity anomaly detection?

Introduction

Artificial intelligence (AI), and more specifically machine learning (and deep learning), are used to detect DGAs (domain names generated by malware for C&C communication), malicious powershells and phishing URLs. A big advantage of this approach is that we can integrate a pre-trained model into our detection infrastructure. However, the “classical” machine learning approach is only possible if we have a labelled and ideally balanced dataset (for example, in a binary classification task, the data set is considered balanced if the number of sane and malicious values are almost the same).

In case of network attacks or intrusions, this approach is not always the most appropriate for detection, especially when analyzing reconstructed network flows. In this case, an anomaly detection approach using machine learning techniques/models can be a valuable asset.

What is anomaly detection?

By definition, an anomaly has two main characteristics : it is a rare event that differs significantly from the norm. In network anomaly detection context, it will be about events (DNS sessions, http,..) whose characteristics are quite different from the ones we usually observe (legitimate /common network traffic). At this level, it is important to specify that an anomaly does not necessarily mean that it is an intrusion or an attack. However, it may be interesting for the analyst to report any information relating to abnormal behaviour on his network. Conversely, we consider attacks that will match the characteristics expected of an anomaly, and will therefore be detected.

Several approaches are possible when building a model based on anomaly detection, the two main ones are :

The unsupervised approach (outlier detection): in this case, the dataset used for training includes a large proportion of samples that correspond to normal behaviour. It should be noted that, unlike “classic” machine learning, the dataset here is highly

The semi-supervised approach (novelty detection): The dataset used during training is built on the presumption that all data points in this unlabeled dataset are normal. Note that the term “semi-supervised” has a different meaning from that of “classical” machine learning.

One of the big advantages of anomaly detection is that it is be possible to customize the training phase according to the user’s specific data.
In the first case (unsupervised approach), the model will be able, after the training, to detect possible anomalies.
In the second case (semi-supervised approach), it is assumed that the part of the network traffic used for training is sane. In both approaches mentioned before, and depending on the amount of traffic available, the model can be trained with few hours or days of data. Thus, the training adjustment presents difficulties and requires more integration efforts: in addition to the training with specific extracted data from the users, it is sometimes necessary to re-train the model on a regular basis if there are significant changes in the presumed ‘normal’ network traffic (due to a change in network infrastructure for example).

For both approaches, there are specific machine learning algorithms, based or not on deep learning neural networks. For the unsupervised approach, one popular example is the isolation forest algorithm, which creates decision trees from the different characteristics provided in the data set. This algorithm is based on the following principle: an observation corresponding to an anomaly requires a short path in a decision tree (i.e. a low number of separations) to be isolated. For the semi-supervised approach, the two best known algorithms are One class SVM and the autoencoder. An auto-encoder is based on neural networks: an encoder part takes the initial data as input and generates a “compressed” version, in a space (called latent space), of lower dimension than the initial space. The decoder part takes the “compressed” version as input and tries to reconstruct the data in the original space. Autoencoders can be used for anomaly detection in the following way: an event will be considered as an anomaly if the decoder part has “difficulties” to reconstruct the original data from the latent space.

In all cases, an anomaly detection model provides an anomaly score as output (the higher the score, the more likely that a suspicious event occurs). Once the training is completed, it is necessary to choose the right threshold score for which a new sample to be analysed will be considered as an anomaly or not. This step represents an additional difficulty in the integration of this type of model, but allows a certain flexibility: indeed, in a cybersecurity context, we prefer a low number of false positives (false alerts) (even if it means being a little less efficient in detection), and therefore a high score threshold.

Examples of anomaly detection applications in Network Detection and Response (NDR) approach

In AIONIQ, Gatewatcher NDR solution, the data processed by our network monitoring system is mainly reconstructed network flow metadata. We have developed several attack detection models with the anomaly detection approach. We present next two of them, which will be integrated in a future version of AIONIQ.

The first model focuses on ransomware detection. In fact, most ransomwares will not only encrypt files locally on an infected computer, but will try to encrypt files available via network shares. In many cases, these network shares use the SMB protocol. It is expected that SMB sessions from a ransomware infection will behave very differently from normal SMB sessions (e.g. in terms of number of commands to read and write files), which fits perfectly with our definition of an anomaly. We have thus developed a ransomware detection model based on this principle using a semi-supervised approach: the model “learns” the normal network trafic, then, it can be used to make predictions, after carefully choosing a threshold value for the anomaly score in order to have a very low false positive ratio. The model performance has been evaluated by reproducing the SMB traffic generated by the ransomware samples.

We have also developed a Kerberoasting detection model, this time using an unsupervised approach. Indeed, this attack is characterised in the first place by a large number of requests for Kerberos Service Tickets (TGS) associated with distinct services over a short period of time. Thus, we expect to see a large number of TGS requests associated with a large number of distinct services, i.e. a behavior that differs from a real user. As a matter of fact, a user can’t request so many services “manually” in such a short period of time.

However, in some use cases, one may find machines configured to automatically request a large number of services as part of their normal activity. These machines will not differ from true Kerberoasting in terms of features, so we cannot be sure that the data on which our model will be trained does not contain any anomalies. Therefore, the unsupervised approach seems to be the most suitable to tackle the detection problem. The model learns on “mixed” data. It is designed to associate higher scores to real Kerberoasting campaigns than to anomalies close to this kind of attack that can be found in an infrastructure.

Both detection models were tested on real network traffic: first for training and then for predictions. Finally, we tested the two models on traffic coming from attacks in order to ensure their good detection capacity. For this, we used for the first model network captures (PCAPS) coming from the execution of ransomwares, and for the second model on Kerberoasting flows coming from tools used by the attackers to carry out the attack.

Author: Gatewatcher Machine Learning Team