WP Optimizing Data Flow | Near Big Data | Imperva

Archive

Optimizing Our Data Flow : Backstage Insights

Optimizing Our Data Flow : Backstage Insights

Not long ago we announced our new dashboard which was designed from the ground up with a combination of HighCharts visualization and our own proprietary Graceland data processing and storage technology.

Development of these new dashboards presented several technological challenges, which forced us to re-think some of our data handling policies, and caused us to become much more mindful of query response times, data processing cycles and – most importantly – overall scalability. Today we would like to share some of the ‘backstage insights’ of that development process.

The Challenge

The conventional approach to Big Data management would call for methodical gathering of logs with information about HTTP requests and responses, client IP addresses, client types (browser, bot, search engine, etc.), etc. These logs would then be consolidated into large centralized data structures which, in turn, would feed the information to our application.

For us, the downside of such an approach was the load it put on the centralized data hub.

Our main concern was with scalability, in proportion to the overall amount of served traffic.

For example, we knew that when one of the sites using our service comes under a DDoS attack, the volume of the log data would drastically increase, as would the number of data processing resources required. Combined with the already rapid growth of our customer base, we felt that such scenarios would result in our constantly dealing with ‘data spikes’, which would only increase as more and more clients onboarded our services.

The ‘classic’ Big Data approach would suggest compensating for the extra load by constantly increasing the computing power of our central data hubs. However, we looked at our network and saw an opportunity to provide a more intelligent solution.

Scaling Up with Split Processing

The solution we devised focused on more efficient resource utilization. Specifically, we noticed that we have a lot of computing power available on the proxy servers, on the edge of our network.

Instead of having our network indiscriminately upload all log data to our centralized Graceland storage, we decided to leverage our already available resources by adopting a 2-step process:

  1. The traffic data is initially processed on ‘edge’ – at the local level of each of our proxies.
  2. The proxies transmit summarized logs to central Graceland nodes, from which they are aggregated into dashboards and also stored for further referencing.

By splitting the processing tasks we also split the load. As a result, the centralized Graceland hub now only deals with a small fraction of the all processing tasks, handling only short summarized log that may look something like this:

siteID: 123456

startTime: 1386018000000

endTime: 1386018060000

numRequests: 1650

numBotRequests: 120

requestsByCountry: {

US: 340

CA: 180

CN: 130

...}

This log tells Graceland that from 21:00 to 21:01 GMT on Dec 2, 2013, this particular proxy processed 1650 HTTP requests for site 123456; 120 of them were identified as bots, 180 of them were humans from Canada, 340 humans from US and another 130 were humans from China and so on.

Even if the site in question was being hit with 3,000,000 DDoS requests per minute, it would take roughly the same amount of bytes to represent the extra traffic. Such an approach helps us even out the data flow and provides a much more efficient alternative to bombarding our central storage with 3,000,000 separate un-processed log files.

The format of these log files is Google Protobuf, which is very space efficient compared to other popular data transfer formats (such as XML or JSON).

In addition, Protobuf is highly compressible by popular compression algorithms such as gzip or bz2. As you can see in the following screenshot, depending on the compression algorithm, the file is compressed to a size 3-4 times smaller than its original size.

Google Protobuf Compression Sample

Need-driven Data Management

As mentioned above, Incapsula builds most of its own software, and we always try to make it as simple as possible. As a result, our dashboard database is as simple as it gets; an Nginx web server serving binary Protobuf log files. We store a small number of un-compressed daily logs for each of the websites on service. Next to them, we also store a larger compressed log archive file. This storage system is based on FIFO (First in, First out) logic, where the oldest single-day log is constantly archived to make way for the most recent batch of daily traffic data.

From Edge to Browser: Incapsula's Dashboard Data Flow

The idea here is to mirror the usage habits of our typical user, most of which revolve around short-term data extraction. Thus, by storing an un-compressed version of the most recent daily reports, we assure swift response to over 85% of all dashboard queries.

This concept of need-driven data management also extends into the way the data is organized within the log files. There the data is split into 3 different groups:

  • High resolution (5 minute data-points) – Contains detailed request-level information.
  • Medium resolution (10 minute data-points) – Contains visitor and security data, most of which is collected on a session level.
  • Low resolution (60 minute data-points) – Contains macro-data, like information about traffic geo-distribution.

When a user accesses the dashboard, several HTTP requests are sent from our application server to one of Graceland’s Nginx web servers. Using HTTP headers we specify exactly what combination of ranges is needed, which allows us to speed up data extraction by selectively pulling the data from the log files. As an additional optimization measure, we locally cache the extracted data on the application server. In this way all similar subsequent requests won’t require any additional remote log access.

Investing in Data Flow

Data visualization is an important part of any website management service, especially one that deals with mission-critical aspects of its security and availability. Although often overlooked in favor of the more ‘hands-on’ features, efficiently visualized data supports our data-driven decision-making while also allowing us to get a grip on – otherwise virtual – service capabilities.

For us, the ability to scale up our data management is closely linked to our ability to evolve as a security service and as a company. Thus, by intelligently utilizing already available computing resources, and by addressing the practical needs of our clients, we were able to create a sustainable and scalable data flow which will support our growth for years to come.