WP Scaling the Incapsula Time Series Data Storage Architecture by 240X

Archive

Scaling Data by 240x: How We Wrote the New Time Series Data Storage Architecture for Incapsula

Scaling Data by 240x: How We Wrote the New Time Series Data Storage Architecture for Incapsula

Data scaling is always a challenge and even a harder one if you want near real time processing and storage. Recently, we wanted to scale up our relatively low-resolution time series data with the goal that we would store up to 240x more data per day.

To meet these challenges, we decided to reinvent our time series data model.

So Why Did We Need to Change Our Existing Model?

We introduced our new Infrastructure Protection dashboard that displayed high-resolution data points at intervals of 15 seconds. Prior to the new dashboard, all time series data was stored at lower resolutions, such as 5-, 10- and 60-minute buckets, as described in our post Optimizing Our Data Flow.

Scaling our data in a multiple of 240 left us with several challenges and opportunities to design a new time series data model. The trick was to create a model that would best fit our current and future needs.

To display near real time dashboards, our in-house time series database processes and stores data from hundreds of our edge servers in a matter of minutes. Fast processing and data fetching means that we needed to have a data storage architecture that was as lean as possible. It also had to be both memory efficient for the processing servers and light in size for data fetching.

Our old data model was built to resolve very specific use cases. For example, the scheme for storing a site’s bandwidth data looked like this:

message LowResBucket {
        required uint32 secondsInDay   
        required uint64 requestSize     
        optional uint64 responseSize 
        …   
        message HighResBucket {
            required uint32 secondsInDay   
            required uint64 requestSize     
            optional uint64 responseSize    
            …
}

This example describes a low-resolution bucket of 10 minutes of data, which holds an inner 5-minute bucket (HighResBucket).

As you can see, this scheme is data specific. We have a specific code which manages it and creates the desired number of buckets for each resolution and stores the relevant data. In addition, all the data that’s stored in the 5-minute bucket is also stored in the low-resolution 10-minute bucket. This means that we store an additional 50 percent of redundant data, since the lower resolution can be calculated from the higher resolution data.

Another pitfall of this scheme is that if we only want to update the high-resolution data, we still need to fetch and load the entire low-resolution object to the memory.

You can see the multiple changes that were feature driven through the years. This is a normal process in a fast-growing technology company lifecycle, but in the long run, this generated a complex and unscalable model consuming memory and storage and is hard to maintain and work with.

How Do We Design a Better Data Model?

The team wrote these criteria for the new data model.

When we approached the design of storing large amounts of time series data, we thought about two data structures: dense time series and sparse time series.

The dense time series model is used to store numeric data. The assumption for this model is that the type of stored data is continuous and the number of filled buckets is close to the maximal number of buckets. That is why this scheme and its implementation creates all the requisite buckets in advance of series creation.

Such data will include metrics like requests rate, bandwidth, cache ratio and other variables.

To answer our dense time series scheme goals, we came up with the following Google Protobuf scheme:

message DenseNumericTimeSeries {
	optional TimeRange timerange
	optional sint64 intervalMillis 
	optional AggregatingFunction aggFunction	   	  		       	 
	repeated sint64 data
}

As you can see, this scheme is completely generic and self-contained, signifying that each time series contains all the relevant data we need to work with it, such as start and end time, aggregation function, bucket size and series data.

In contrast to the dense time series, we also collect sparse data, which can exist only in minimal data points. For example, it would be the number of requests from a specific country in a given time frame. For sparse data, we used a different scheme which is more similar to a hash map:

message SparseTimeBasedMap {
	optional TimeRange    timerange
	optional sint64       intervalMillis
	repeated TimeSlice    slices
}

message TimeSlice {
	repeated MapEntry    entries
}

message MapEntry {
	optional string    key
	repeated sint64    counter
}

With the new schemes, a dense time series object of a single day with 15-second buckets, will only consume 58 KB of RAM. Compare this to 155 KB which would have been consumed if we had stored the same data in the old scheme.

By using the new scheme, we managed to gain a 62 percent memory usage reduction, which is pretty cool! 😉

This means that we can process way more data than we used to and are doing so without changing any system architecture or hardware.

With the new scheme, we also gained a lot from the Google Protobuf space efficiency, when it is stored to the disk. For example, a data file containing 50 daily time series with 15-second buckets that takes 2.9 MB of RAM is stored to just 283 KB on the disk, compared to 781KB it would consume with the old scheme.

How Can We Use It Easily?

The generic scheme allowed us to develop a set of utilities which handles the series’ inner logic such as bucket creation and aggregation.

For instance, this is how we create a week long 15-second bucket time series:

DenseNumericTimeSeries bandwidthSeries = DenseNumericTimeSeries.newSeries( oneDayTimeRange, fifteenSecInterval, AggregatingFunction.SUM );

Updating and retrieving data is super easy as well:

bandwidthSeries.update( timestamp, value );
bandwidthSeries.get( timestamp );

The time series utility library also supports aggregation of several time series, so we can keep the data in any granularity we choose, and later aggregate it to our needs.

Wrapping It Up

Massive scaling can be a real pain as you know, but it can also be a chance to rethink and reinvent.  The solution can be more sophisticated than just adding more hardware to the game. In our time series use case we managed to scale in a multiple of 240, and did it with an enormous gain in efficiency for memory and storage.

Since we introduced our new model and its utilities, collecting and working with the time series data in Incapsula has become a procedure developers can use on the fly. The best part: We have solved many pain points and built an infrastructure that serves our needs perfectly.