While the term AI (Artificial Intelligence) is now widely used, its usage sometimes remains unclear or inaccurate. Minsky and McCarthy, pioneers in the field, described AI as “any task performed by a machine that would require intelligence from a human if he were to perform it.” Today, AI has expanded to include machine learning and deep learning techniques. A complex field, especially when you consider that rule engines or old inferential statistical methods (linear regressions etc) are also part of it. At a time when AI is one of the components of cybersecurity solutions, it has become essential to understand its role and its specificities. To see more clearly, we asked four engineers at Gatewatcher: Philippe, Aubin, Jérôme and Hugo.
The integration of artificial intelligence in cybersecurity
When was AI first integrated into cybersecurity solutions?
This has been the case for about ten years in specific areas such as fraud detection or spam identification. For the past three to four years, we have seen a strong rise in the use of AI in cybersecurity solutions. This can be explained by the development of machine learning, deep learning and all the tools available.
What are the fields of action of AI and their differences?
Machine learning is the fastest growing branch of AI, and it is also the one that is most confused with AI itself. It is a fairly old field of AI, with the first technical advances dating back to the 1950s. Over the last fifteen years, the field has experienced a strong revival thanks to the increase in computing power and the amount of data available: this is the advent of Big Data. From a technical point of view, machine learning is a technology based on statistics that allows systems to “learn” autonomously from data, without having been programmed beforehand for this learning. Instead of formalizing rules manually, the model will deduce them from the data.
The field itself is divided into several families, the main ones being :
- Supervised learning which uses pre-classified data for training to predict the class of new data. In cybersecurity, this involves using “records” of past attacks to try to identify similar future attacks. The advantages of supervised learning are twofold: this branch is quite mature and the algorithms work well within reasonable computation times. But this type of learning requires obtaining ranked datasets for training, which is often time consuming or expensive. Moreover, it does not allow a priori to detect classes not present in the training set.
- Unsupervised learning uses unlabelled data and/or data whose classes are not known. The data is grouped into clusters which are homogeneous groups of data according to the available attributes. This branch is used for example for anomaly detection. This learning has advantages: there is no labelling of data (which makes it more widely usable) and it allows to identify new trends or classes. Nevertheless, the algorithms are often expensive in terms of processing time and the results are often complicated to interpret. They require in-depth analysis that requires high computing power.
- Semi-supervised learning is used for partially labeled data or for data for which it is suspected that new classes will appear soon. The methods aim to determine the underlying distribution of the training data by detecting variations when new data arrives. This class is particularly useful for anomaly detection or diagnostics. It is relatively easy to implement and does not require exhaustive labeling. Nevertheless, it does not tolerate errors in the initial labels and the fields of application are restricted.
- Deep learning is a family of machine learning algorithms, covering both supervised and unsupervised cases. It is based on deep artificial neural networks that extract different levels of information from the original data and give them a meaning in output. This specific field of machine learning is able to handle complex tasks. The main algorithms for image detection or NLP (natural language recognition) are based on it but are extremely expensive in terms of computing power and data needed for training. For example, GPT-3, the most powerful NLP algorithm recently published by OpenAI, has 175 billion parameters and has been trained on nearly 500 billion words, which represents almost the entire internet.
What are the specific use cases for each AI field of action?
Supervised models can be used to detect (or enhance existing detection systems) attacks or malware whose characteristics are already known and referenced.
Semi-supervised and unsupervised models can be used to detect anomalies, for example in user behaviour or network flows.
Unsupervised models can also be used to identify new threats and improve the analysis of existing alerts.
Artificial intelligence in Trackwatch
What is the AI part of the Trackwatch solution ? How does it complement the engines?
In Trackwatch, the AI has a role in detecting anomalies in network flows and detecting malware.
Machine learning is applied to the detection of DGA (Domain Generation Algorithm) and in the detection of malicious powershell scripts.
The presence of DGA-generated domain names is a strong indicator of compromise. This is because attackers often use http requests with domain names generated by certain types of algorithms to connect malware to command and control servers. These domain names have different properties than valid domain names. Traditional detection approaches (blacklisting, etc.) are not sufficient because they do not generalize to other malware strains. The domain names present in the DNS events captured by the GCaps are analyzed by the machine learning engine. It returns a probability for each such event that the domain name is a DGA. The engine uses a pre-trained model, so the architecture is based on a deep neural network of type LSTM (Long Short Term Memory networks). Supervised learning is performed only from domain names: no additional contextual information (NXDomain…) is used.
Concerning malicious powershells, the detection is based on a supervised machine learning model, and on the fact that these scripts generally use obfuscation or similar techniques (base64, concatenation, type conversion…)
What are the advantages and disadvantages of a detection system with AI compared to a detection system without AI?
On the plus side, it can process the huge amount of data available and “understand” it, identify links between different data sources for better detection, save analysts time on time-consuming tasks, and enable zero-day detection of new attacks. On the downside, AI is more cumbersome to implement on simple and known cases.
How does the AI in Trackwatch work and what does it do?
Some behaviors (often referred to as deflection) are modelable and are very suitable for machine learning. This type of attack is often complex because it closely resembles – or uses – camouflage techniques that resemble legitimate traffic. Trackwatch’s AI has therefore been designed in a scalable way to study these anomalous behaviors. Because of our ability to collect a lot of data, our algorithms are able to learn quickly and isolate threats.
What are the advantages of AI in Trackwatch?
We have a major advantage thanks to the premise we adopted from the start: instead of trying to detect everything based on AI, our approach is more pragmatic and realistic. Indeed, we study the categories of attacks and check which technologies are adapted to the threat. If the AI is adapted, we inject detection algorithms. If it is not, we adapt our other engines to the threat. Being Full AI seems to us to be a heresy in itself. The AI is therefore optimized to detect certain types of threats. It is constantly evolving and learning.