Our RASP product
At Imperva our team builds a product called RASP which stands for Runtime Application Self Protection. As indicated by the name, it is a security product which plugs directly into the runtime of an application in order to provide a similar and complementary set of capabilities as a WAF1. As of the RASP 4.0 release, our architecture consists of a core analysis engine written in Rust and plugins which interface with the application runtime (think NodeJS, Python etc).
Why logging matters
One important responsibility of the core engine is providing reports for our customers to consume. For instance, if a client’s application is under attack and RASP is rejecting a large amount of malicious requests, instead of paging ops the dashboard will simply reflect this information along with the specific attack vectors. We consider providing this observability a critical feature and have built a robust and flexible system around it. In order to understand this system more deeply and to discover strategies for increasing its performance, reliability, and efficiency, we did a deep dive on logging and are presenting our results here.
Logging, in some ways, has a strange set of design parameters:
- There are no hard deadlines for logs to appear.
- Generally a non-critical part of the application.
- Small quantities of data are processed with a relatively high individual overhead.
- There are a diverse set of possible outputs.
In order to explore solutions to this problem space, we created a library for various logging scenarios and included benchmarks. You can find the source on our github.
The first step in writing structured log data is serialization. This means taking an in-memory representation and translating it into a byte-string which can be sent to Input/Output (I/O) facilities. We’ve designed some benchmarks to understand our performance costs for various sizes of log data. Additionally, by exploring the cost of serialization separately we can establish a performance baseline independent of I/O.
We start with our control benchmarks which show that a disabled logger and a logger that constructs a message from borrowed data (and does not serialize it) are equally performant from 1kb to 500kb of randomly generated characters. (Likely the compiler has optimized out the useless instructions we gave it.) Consider the message below that these tests construct. If you’re unfamiliar with Rust, simply put
& indicates “borrowed” or uncopied data and
'a is a specifier for said references that provides information to the compiler so that it may drop this struct when it is no-longer used.
The above log message “plus” includes thread information and the current time, which are often included with log messages in the real world. We see a substantial decrease in performance here, however as we’ll see below this is orders of magnitude less costly than actually serializing the data. If you’re thinking this must be due to a syscall for current time as I did, think again, unix provides
vdso(7)10 11. However retrieving thread id with
gettid(2) is 12 13.
Contrived and real world log serialization
Once we actually start serializing we see several orders of magnitude difference in the time/operation. We also see a roughly linear increase in time when varying over size of message.
As stated above the following “plus” message includes time and thread information but its impact on performance is minimal.
Finally, out of curiosity we tested Rust’s built-in debug serialization. Hilariously it does much better than the common serde serialization. Our guess is that the cost can be attributed to either an additional overhead in serde being a very general serialization scheme, or an additional penalty for JSON serialization vs debug serialization.
With these baselines established we can move on to the holistic view of logging.
I/O is generally thought of as the more expensive operation involved in logging. In this section we’ll consider the possible outputs for our log data and the consequences of using each of them. Finally, we’ll test our theory with benchmarks and discuss results.
Background and underpinnings
Let’s start at the bottom where we hand data to the kernel to be written.
write(2)4,5 is the system call that allows us to actually perform I/O with our log messages. Since
write(2) is a syscall, it will pass control to the kernel (in effect blocking the current thread) until its operation is complete. Unfortunately, the time required for said operation varies depending on the stream type. Below are some possibilities. In considering them, we’ll make some assumptions2.
sockets3 are dependent on a reader being present and consuming data on the other side in order to avoid blocking, so we’ll set these aside for the moment.
The standard streams (
stderr) are excellent for debugging purposes and are commonly used to collect primary application logs from Docker containers. Since the standard streams are in-memory outputs, write operations happen immediately and are blocking. Typically languages (for example libc and Rust) provide wrappers around
stdout which buffer line-wise if connected to a TTY.
stderr however is always un-buffered. If you’ve ever printed a lot of output to the console while your program is running you’ll recall that it slows things down substantially.
file output is the most attractive: durability and the non-reliance on a third-party process to keep the stream moving make it the object of our interest. When
write(2) is called with a regular file, some of the data in the given buffer is copied to the page cache and enqueued to be written. In practice this does not concern us, as we are guaranteed to be able to read that data immediately from any other process. This means the fact that the data is not actually on disk is only a very temporary durability concern. From the
"A successful return from `write()` does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data."
Eventually this page writeback is handled by a gang of kernel threads.6. For reference, here are the results of a similar benchmark to the control group but with naked writes to
stderr and a regular
Regular file output:
Stderr redirected to
Stderr redirected to a file:
We see that using
stderr, its performance is nearly identical to the stream that it is redirected to. Now that we have an understanding of what happens after we pass control to the kernel and a baseline for performance, we can start to consider the options available to us.
Userspace performance enhancements
The most typical way to speed up syscalls is to avoid making them, or more precisely to make them less often. We can accomplish this by buffering data in userspace and therefore amortize the cost of the syscall and corresponding context switch over many calls. In the following results we consider line-wise and fixed size buffering for a regular file output.
Fixed size buffer:
As expected, line buffered output is worse since we’re already writing one line at a time, and we need to copy that data around twice now. However, fixed size buffering looks promising which makes sense since fixed buffer writes are standard when writing files.
The next possibility to consider when you have access to multiple threads, is shipping data to a worker thread to be processed off the critical path. The rub with Rust is that since there is not a garbage collector, the lifetime of data must be known at compile time. This means when writing a log message that contains borrowed data, as most should, that data cannot be passed between threads safely since the worker thread could hold the reference for any amount of time. If we controlled the full pipe we could obviously wrap borrowed data in an
Arc, but alas we do not. Therefore, in order to pass log data to a worker thread it must be fully owned. In practice this could mean performing serialization up front or simply copying borrowed data into a new struct. Below we’ve built some benchmarks which explore a very simple queuing scheme with a single worker thread connected to a concurrent queue, in this case an mpsc channel.
Queued pre-serialized data, buffered file output:
Queued copied data, buffered file output:
Bounded queued copied data, buffered file output:
These results indicate that copied data is universally preferable to pre-serialized data when queued for writing off thread. When considering a bounded vs. unbounded queue, in our experience running queues with unbounded size tends to slow things down substantially and provide unreliable performance if the queue is allowed to grow continuously. Above we see that this is exactly the case.
The two most promising solutions available to us are the standard synchronous buffered writer and the bounded queue of copied data which is serialized and written on a worker thread. It’s clear from the chart below that between 10k and 100k, there is an inflection point where the best strategy changes.
Considering the fine-grained view of this data below, it’s clear that with log messages roughly 10k characters and below, we’re better off queuing logs for off thread writes. With very large log writes, 50k characters and above, using a standard buffered writer is ideal.
Hopefully you enjoyed this deep dive into application logging and it answered some questions you didn’t even know you had.
-  Web Application Firewall: https://www.imperva.com/products/web-application-firewall-waf/
-  Link to the linux
-  Link to apple’s
-  Link to the linux
-  Link to apple’s
-  Lets assume we’re on unix, I/O is unsaturated and fd’s are not in async mode.
-  UDP sockets are actually quite common as sinks for metrics data, but metrics have a slightly different design goal in that they do not necessarily need to be durable.
-  Link to apple’s
-  Link to the linux
-  Page Cache and Page Writeback: http://sylab-srv.cs.fiu.edu/lib/exe/fetch.php?media=paperclub:lkd3ch16.pdf
Get the latest from imperva
The latest news from our experts in the fast-changing world of application, data, and edge security.