Published on Connect (Diamond) by Florian Maury/Gatewatcher.
In April 2019, the ANSSI qualified the first sensors to provide network security supervision. Operators of vital importance (OIV), operators of essential services (OSE) and, in general, organizations operating sensitive functions thus have trusted French products: Cybels Sensor by Thales and Trackwatch Full Edition by Gatewatcher. Unfortunately, the evaluation methodology for these sensors is not public. The security and network engineers who have to integrate these sensors do not have any guidelines for testing their effectiveness in production. This article provides feedback on the evaluation of probes, particularly from a performance perspective. This aspect is, indeed, particularly significant since the detection rate of a probe decreases if it is submerged, even if it is equipped with the best signatures and analysis engines.
1. On the concept of budget
The various analysis engines that make up a probe consume machine resources. These are in finite quantity and depend on the hardware on which the detection and analysis mechanisms are installed. These resources include computing capacity (CPU, FPGA, ASIC) and memory (RAM, working memory of peripherals, etc.).
It is usually possible to make some computation/memory trade-offs to optimize processing. However, these resources are ultimately limited to the time required to process an incoming packet. The non-adjustable variable is, in fact, the incoming flow in a probe. This throughput is expressed in two units: the number of packets per second, and the number of bytes per second.
The number of packets per second is significant, as it defines the time budget available for packet processing. One million packets per second means that the probe has one millionth of a second to process each packet on average. Furthermore, the number of packets per second influences the number of interrupts (hardware or software) that a CPU has to process. Just as an engineer is less productive if he is constantly interrupted, a CPU loses time with each interruption. As a result, like many pieces of equipment, it is easy to drown a probe in a lot of packets representing extremely low byte-per-second traffic.
However, the number of bytes per second remains significant. The latter has an impact on the packet transfer times between the various components of the probe. This impact is particularly noticeable during packet recopies between hardware or software components.
As soon as a probe exhausts its budget for processing one packet, the budget for subsequent packets may suffer. If it is only a spike in traffic, it is usually possible to absorb it by smoothing the cost over the following milliseconds/seconds. This is because some packets are processed faster than others, such as those that are immediately discarded because they are part of an encrypted stream for which the probe does not have the decryption key. If the spike in activity continues, the probe has no choice but to discard traffic. Where and how traffic is dropped is determined by the location of the bottleneck.
Received packets can be dropped by many hardware or software components, as soon as a budget overhead occurs. This can happen as early as the network card receiving the packets and not being able to broadcast them to the kernel, or later, somewhere in the kernel or in the detection software itself.
2.1 Losses through the network card
Network cards have a relatively small internal memory; this is a queue. Each packet that enters the card is inserted into this queue. The queue is then consumed by a process that transfers the packets to the server’s main memory. This is done by DMA (Direct Memory Access) writes or variants, such as IOAT/DMA (I/O Acceleration Technology DMA).
A network card can saturate its queue, if it cannot perform DMA writes as fast as packets arrive from the network. Besides possible hardware slowness on the communication buses themselves, the main cause of slowness is the filtering of memory accesses by the IOMMU (I/O Memory Management Unit) [PCIe]. This is a DMA write manager, capable of limiting the memory ranges that a server’s devices can write to, much like a firewall limits access to a network. Its function is crucial for server security, but totally counterproductive if it results in the inability of the probe to fulfill its role.
2.2 Losses in the core
When it starts, the detection software calls the kernel to configure a packet acquisition method. Several exist, among which the most popular are certainly AF_Packet (native in Linux), PF_Ring, DPDK, or Netmap.
Typically, the detection software configures the kernel to reserve a memory slot for receiving packets written by DMA.
The larger the memory range, the greater the quantity of packets that can be stored while waiting to be processed. If, however, this range fills up faster than it is consumed by the detection software, then packets are discarded.
The slow consumption of incoming packets is not systematically due to slow detection software. Indeed, the kernel performs transformations or adds metadata, which can be more or less costly. Losses can therefore occur even before the detection software is warned of the reception of packets! This case can be observed in particular if one uses very greedy XDP (eXpress Data Path) programs, or low-performance algorithms for distributing packets (load-balancing) between the various processors.
Another cause of packet loss involving the kernel is file extraction. Detection software can store received packets (PCAP files) or suspicious files transiting the network (HTTP, SMB, SMTP…). If the storage devices are not fast enough, the system calls to operate on the files (write(2), sync(2)…) might be slower to hand over to the detection software, and thus exceed the budget. As a result, packet loss is possible when large amounts of files need to be extracted for quarantine and subsequent analysis.
2.3 Losses due to detection software
The organization of the components that make up the detection software and the complexity of the analysis rules can have a significant impact on performance, and result in packet loss.
The analysis engines are broken down internally into several sub-tasks:
– acquire new packages to process;
– Putting packets into context (e.g. “is it part of a TCP session?”, “is it a fragment of a previously seen packet?”);
– perform security analysis of packets or flows, to infer security events or extract suspicious files;
– produce logs recording the events generated.
These subtasks can be handled by one or more processes. Using multiple processes can be advantageous to parallelize processing, and not end up with a work overload that could saturate the processing capabilities of a single process. Unfortunately, using multiple processes is not a panacea either, as these sub-tasks may require shared states. It is then necessary to organize access to these states, through proxies (as Zeek/Bro does) or locks (futex/mutex, as Suricata does).
A fairly extensive performance analysis of the Suricata detection software has been conducted [SuriLock]; it shows that locks are one of the main bottlenecks that can cause packet loss. In this analysis, it is stated that Suricata performs a lot of concurrent access to TCP session states.
Thus, network traffic with a large number of simultaneous TCP sessions or many new connections per second can result in the slowing down of Suricata. Indeed, all processes aiming at replacing a packet in its context will be frozen while waiting for the release of locks to access shared states.
In addition, not to mention locks, a large number of TCP sessions can saturate the hash tables in which states are stored. However, the algorithmic complexity of accessing the elements of the hash tables increases from O(1) (constant time) to O(n) (linear time, function of the amount of collisions in the hash table) when they saturate. This results in a CPU overhead, and thus an overhead in the time budget for processing a packet.
The software architecture of the analysis engines can also influence performance if the packets received are not properly distributed among the different processes in charge of processing them. Packet loss occurs when a CPU is saturated/flooded by the processing that an analysis process must undertake. This situation occurs very easily when the distribution of packets received is not random, but tends to concentrate all the packets relating to the same flow (e.g. TCP sessions) on the same analysis process. This per-flow distribution is the most common and preferred, as it allows to limit accesses to shared resources and to increase the locality of memory accesses [SEPTUN]. The problem is that this methodology does not correctly distribute traffic including tunnels (IPsec, GRE, L2TP, TLS…). Indeed, unless the dispatching program performs deep inspection (DPI) of the traffic (which is not even always possible, e.g. in case of fragmented packets), all packets in a tunnel will be analyzed by the same analysis process! If this tunnel is very active, the analysis process will easily be overloaded, and packets will start to be lost.
Finally, the analysis engines (e.g., dissectors, content analyzers, analysis plugins) can be sources of performance issues, resulting in packet loss. Cloudflare’s application firewall caused their entire service to be unavailable in July 2019 [CloudflareWAF], due to a CPU-intensive regular expression. Similarly, a poorly designed detection rule will be able to significantly slow down packet analysis, and over-consume the budget.
From the experience of the author of this article, the problem is even worse with plugins written in Lua. Lua uses only co-routines to simulate parallelism. As a result, if a Lua instruction puts the Lua interpreter on hold, without giving it back to the interpreter to execute another co-routine in the meantime, all packet parsing processes that use a Lua plugin will freeze [LuaLock].
3. Biases introduced by evaluation methodologies
The conditions discussed in the previous section of this article can occur when capturing traffic on production networks. However, since probes are components that are typically connected to sensitive networks, many operators prefer to evaluate these devices by emulating such networks before selecting one and connecting it to their production. Similarly, probe manufacturers need to evaluate their product to ensure its performance.
However, emulating a network is not an easy task, and many biases can be introduced, positively or negatively distorting the perception of the real performance of the probes! This section details some common mistakes that the author of this article has observed or made in the course of his professional activity.
3.1 Tools for emulating social networks
In addition to commercial platforms, whose effectiveness and relevance of tests could certainly be the subject of formal studies, several free tools exist. Among these, it is worth mentioning :
– tcpdump, which allows traffic capture, filtering and storage in PCAP files [tcpdump] ;
– tcpreplay, which allows replaying PCAPs at varying speeds, and even editing PCAPs [tcpreplay] ;
– TRex, from Cisco, which is a complete traffic replay platform [TRex] ;
– scapy, a Python library dedicated to network packet manipulation. It allows you to capture traffic, analyze, filter and edit it, before saving it as PCAP files, or sending the packets back to the network [Scapy] ;
– tc qdisc, a set of tools under Linux allowing in particular the emulation of certain conditions on a network, like the limitation of flow with the module tbf, or the creation of latency or instabilities (packet losses, replay…) with the module netem [tc].
tcpdump is used to capture and record traffic on moderately active networks. When networks are too fast, tcpdump may not be able to collect all packets received. The recorded network flows are then corrupted. Regardless of the technology used for traffic replay, it is crucial that the network captures used are representative of the situation you are trying to emulate. In addition, it is normal to want the traffic sent to the probe to contain packet losses, replay and reordering. However, these should be desired and emulated on purpose, rather than the result of chance and inadequate capture methodology.
It should also be noted that capturing traffic from a real network can generate unrepresentative PCAP files, despite a legitimate data source. Indeed, it is absolutely critical to clean up these captures, as they contain half-streams [CleanPCAP]. These half-streams are flows that started before the capture began or will end after the capture is complete.
Half-streams started before the capture begins are problematic if the probe is configured to ignore such flows. By replaying these half-flows, the novice evaluator may get the impression that a large amount of traffic is being sent to the probe and that the probe is behaving perfectly, without losing a single packet. In reality, the probe will not perform any analysis of the received packets and will fail them, without raising any alarms, even in the presence of malicious traffic.
Half-streams that do not terminate during capture are also problematic if the capture is relatively short, and replayed in a loop, e.g. with tcpreplay. Indeed, since these streams never end, they will create new states to maintain in the probe at each iteration, which can cause an explosion in memory cost, and a saturation of various internal hash tables. However, if the detection software is optimized not to crash in the face of SYN Flood DDoS, an explosion in the number of flows falls within the spectrum of logical attacks. The probes then have no choice but to arbitrarily discard flows, including potentially malicious ones, to avoid crashing. To do this, aggressive timeouts are used.
tcpreplay can replay PCAPs containing any kind of stream and can be an effective tool for evaluating a probe. However, biases can be introduced when the PCAP is replayed in a loop using the –loop option.
The first bias observed by the author of this article occurs if the periodicity of loopbacks is less than the value of the CLOSE_WAIT timeout configured in the probe. In absolute terms, CLOSE_WAIT is a state of the TCP state machine that is independent of received packets, and evolves only after a certain timeout. Its purpose is to prevent a server from believing that a new TCP connection is being established because of duplicate network packets finding their way in after the original session has closed. Since probes must emulate the TCP stacks of the servers they protect, they must have a grace period for the CLOSE_WAIT state representative of those servers. However, if tcpreplay loops the same PCAP “too fast” (relative to the grace period), then the traffic will be partially ignored by the probe, as it would have been by the server to which the original traffic was addressed, according to the TCP protocol specifications! The result is a probe that looks like it is processing a lot of packets, when in fact it is ignoring them.
To avoid the previous bias, it is possible to use the –unique-ip option of tcpreplay which varies the IP addresses at each new iteration of a PCAP. Unfortunately, this option leads to the second bias!
This second bias is a funny and unlikely coincidence. There is, in fact, an interaction between the algorithm used by –unique-ip and some packet hashing methods in order to allocate packets between the analysis processes!
Packet hashing methods to distribute flows require a rather unusual property. Indeed, it is necessary to hash to the same analysis process both the requests and the answers of the same flow. It is therefore necessary to use a so-called symmetric algorithm, which will hash the IP addresses in an identical manner even if the source and destination IP addresses are reversed. However, some allocation mechanisms, such as the PF_Ring packet capture method, use a simple modular addition of IP address bytes. Thus, a packet going from 192.168.0.1 to 192.168.0.2 will give a hash equal to 192+168+1+192+168+2 = 723 modulo N.
The implementation of –unique-ip in tcpreplay simply subtracts the number of PCAP iterations from one IP address, and adds the number of PCAP iterations to the other IP address. This is a zero-sum mathematical operation and will cause the probe’s packet allocation algorithm to always send packets to the same analysis processes. If the PCAP is short enough, then the probe will be artificially flooded with some processes, while all others will receive almost no flow!
TRex is a platform developed by Cisco allowing the replay of a range of PCAPs. Each PCAP must contain only one flow. The configuration of this platform then allows to specify the relative frequency of each PCAP compared to the others. It is then possible to send more or less flows, knowing that the quantity and nature of the flows sent are controlled.
TRex avoids the biases introduced by tcpreplay by randomly varying the IP addresses each time a stream is resent. In addition, for forwarding streams, it sends packets from an address A to an address B on one network interface, and return packets from B to A on another network interface. This results in a more realistic method of acquiring flows from the probe, as it is closer to the method of acquiring flows on an optical fiber.
The only downside to TRex is its complexity; it is easy to send traffic to a probe that exceeds one of its specifications, either in number of packets per second, new streams per second, total number of streams, files to be extracted, or other.
Scapy is probably an indispensable tool for probe evaluators. It allows them to rework a PCAP file, including cleaning up half-streams, altering streams to duplicate or remove packets, and deliberately corrupting checksums. Its only real flaw is its relative slowness, mainly due to its packet abstraction model, and the Python language. This makes it impractical to operate on multi-gigabyte PCAPs.
tc qdisc (traffic control — queue discipline) is a framework based on the Linux kernel and userland tools to alter the flow of packets on specific network interfaces. This tool is particularly useful when an evaluator is looking to create PCAPs for TRex. Indeed, it is possible to create a controlled environment, for example with a pair of virtual interfaces (veth). One of the interfaces then runs the server and the other hosts the client. tc allows this test environment to be altered to introduce deliberate disturbances (e.g. delays, losses, reduced throughput, etc.).
3.2 Biases introduced by the injection of non-representative flows
The behavior of a probe can vary significantly depending on the traffic received. Their detection software must be able to adapt to any type of traffic likely to be received by the equipment they protect. However, processing a large number of small packets, or conversely processing jumbo frames (frames whose size exceeds the traditional 1514 bytes) requires contradictory memory allocations, if only to store the packets while they are being processed. Similarly, many separate flows are not managed in the same way as a few well-known, but massive, flows.
However, a probe, by default, must be able to handle all these situations. Their configuration must therefore be generic enough to give the software the flexibility to handle them. The result is a potentially inadequate allocation of resources for extreme cases, such as processing very high speed network traffic. Worse, these generic configurations tend to “overbook” the available resources. The optimistic assumption is that several extreme situations will not occur simultaneously. Since the machine’s resources are finite, this can result in denial of service of the detection software (e.g., triggering the Linux OOM Killer, which kills memory-intensive processes).
Finally, the nature of the injected streams can also influence the behavior of the probe. A large quantity of encrypted flows is, in general, easy to manage for a probe. The probe does not have access to the decryption keys; the flow can therefore generally be ignored, in favour of analysing plaintext packets. On the other hand, a large number of distinct flows (e.g. TCP sessions or UDP questions/answers…) can cause the probe to explode in state, as detailed earlier in this article.
In the course of this article, various :
– bottlenecks in a probe ;
– critical resources ;
– sources of slowness ;
– biases that may be unintentionally introduced during the evaluation.
Unless you have developed significant expertise in the integration of this type of equipment, it is therefore necessary to refer to deployment guides and proven and generic testing methodologies. Unfortunately, such documents do not yet exist, and engineers must systematically reinvent the wheel, even if it means producing square ones from time to time.
In the world of industrial systems, often considered by the IT security community as the ugly duckling, this problem has already been solved! The ANSI/ISA 62443 certification and the ISA-Secure [ISASec] program specify a very precise methodology for the evaluation of industrial systems, and the equipment that can automate these tests. The requirements are therefore clearly established, the test methodologies documented, and the evaluation products certified. These tests include compliance, robustness to abnormal traffic, and robustness against scalability.
These elements are sorely lacking in the world of sensor probes, whose response can vary significantly, as described in this article, depending on the traffic received, the nature of its flows and its intensity.
I would like to thank my reviewers: Erwan Abgrall, Baptiste Bone, Piotr Chmielnicky, Sebastien Larinier, as well as those who wished to remain anonymous, and the Gatewatcher employees. The opinions expressed in this article are not binding.
[CleanPCAP] Script to clean the half streams: https://frama.link/RfNczV0d
[CloudflareWAF] Cloudflare incident report involving their web application firewall: https://blog.cloudflare.com/cloudflare-outage
[ISASec] ISA Secure certification site: https://www.isasecure.org/en-US/
[LuaLock] Examples breaking the Lua collaborative thread model: https://stackoverflow.com/a/18964444
[PCIe] Neugebauer et alii, “Understanding PCIe performance for end host networking,” August 2018
[Scapy] Scapy tool repository: https://github.com/secdev/scapy
[SEPTUN] Document on Suricata’s acquisition performance improvement: https://github.com/pevma/SEPTun
[SuriLock] Study on the incidence of locks in Suricata: https://xbu.me/article/performance-characterization-of-suricata-thread-models/
[tc] Documentation on network incident emulation with Traffic Control: https://wiki.linuxfoundation.org/networking/netem#packet_duplication
[tcpdump] Tcpdump tool site: https://www.tcpdump.org/
[tcpreplay] Tcpreplay tool website: https://tcpreplay.appneta.com/
[Trex] TRex tool site: https://trex-tgn.cisco.com/
To go further