WP New tool for Incapsula customers: XRAY for improved visibility

Archive

XRAY: Troubleshooting a Reverse Proxy in the Cloud

XRAY: Troubleshooting a Reverse Proxy in the Cloud

Reverse Proxy — a Visibility Liability

A recent study by w3techs shows reverse proxy technology is gaining further adoption among web applications. Since the beginning of 2016 the number of websites using a reverse proxy-based service has steadily grown from 5.5 percent to 8.3 percent, an impressive 51 percent increase in adoption.

 

This architectural design is the core of today’s CDN and web security services.  The many benefits of reverse proxies include

  • Improved performance through caching
  • Enhanced security through a web application firewall (WAF)

On the flip side, there are deficits when it comes to visibility which require the development of new troubleshooting methodologies. In this post, I’d like to focus on the limitations a reverse proxy service introduces when it comes to the ability to troubleshoot production environments and to quickly identify unexpected issues with limited visibility. I’ll also share some of the lessons we learned and what we’ve done to help our customers overcome such challenges. Spoiler alert: It’s called XRAY.

The Complexity of the Reverse Proxy Architecture

A reverse proxy is an intermediate server that processes requests from end users as if it was an origin server and requests resources from the origin as if it was an end user. The platform creates a perfect distinction between the user and the origin so the two never speak to each other. If we examine the challenges of such a design pattern we can identify seven key areas:

A. Losing visibility of what’s behind the curtain

From an end user perspective, as in RUM (real user monitoring) tools, visibility stops at the application edge. These tools and methods have no way to understand the behavior or performance between the edge and the origin. For example, when a performance degradation is identified in a website using a reverse proxy or CDN it can be mainly a result of two factors. It’s either an application problem such as increased server load or it’s a connectivity issue that may be related to internet routing. The two are very different and need completely different approaches to handle.

B. Non-Deterministic Caching

When adding a reverse proxy such as Varnish, NGINX or HA proxy it’s usually to

  1. Enhance end-user experience through reduced latency, and
  2. Improve infrastructure utilization by reducing compute load on the origin

However, through static content caching you are practically adding an extension of your application located outside your web and application server stack.

This means that many requests don’t even get to your origin due to caching logic which is used to control

  1. What is being cached
  2. For how long, and
  3. Under what circumstances a request should be passed to the origin or not

This caching behavior must be taken into consideration to understand why a website behaves the way it does.

C. Security Policies and Rules

If a reverse proxy such as WAF is part of the inline inspection mechanism of traffic flowing to your web site, some requests will be blocked at the security layer. Visibility into the behavior of your site security layer is important both for

  1. Forensic analysis and
  2. Testing custom security rules for minimizing false positives

D. Monitoring Logs

All processing layers, both on the origin side and on the reverse proxies fencing it are usually capable of generating logs, but putting a system in place that can collect and analyze logs from multiple system requires

  1. An appropriate SIEM aggregator such as Splunk, and
  2. The know-how to operate and build meaningful analytics on top of it

Collecting logs in a distributed manner from multiple systems with minimal delay from real event occurrence adds complexity and cost.

E. Correlating requests

Once a SIEM system is in place and the data is streamed in a reliable manner, it will be nearly impossible to extract any meaningful cross-systems information as each log is a silo of data. A method to correlate the different logs on each reverse proxy and on the origin itself is required. Otherwise those are multiple distinct and parallel data sets that are never meant to meet.

F. CDNs and the geographic multiplier

To add even more complexity to the mix, imagine you have not a single reverse proxy but 100 or 1,000 or 10,000. Now imagine this network is spread globally in dozens of different locations. Each of them independently handles requests hitting it based on DNS GeoIP decision, internet BGP advertisements, peering relationships between transit providers, caching behavior and security policies. This is exactly what happens if you’re using a content delivery network such as Incapsula.

G. The end user

There’s still one critical factor that we haven’t addressed, the client – your website visitor. The root cause for a large variety of issues is out of your hands when it comes to your website end users. For example, a specific local ISP may have misconfigured Internet routing, causing user traffic to cross oceans and continents for no real reason, which degrades your perceived website performance as a result. When a global CDN network is used, each client may hit different reverse proxies in different PoPs. Each of these proxies may react in a different manner based on preset cache state and policies. If there’s no ability to differentiate between different clients and get insight into a specific client perspective we are in uncharted territory where we have no way to effectively troubleshoot.

The Characteristics of a Complete Solution

Let’s envision how an ideal troubleshooting framework looks. These are the seven characteristics we would expect a complete solution to have:

A. Required granularity – The solution should provide visibility for each request. Aggregations are useful for identifying trends but when it comes to troubleshooting a specific request, an instance-based granularity is mandatory.

B. Required processing data – Each layer should provide just enough information to derive the processing outcome of each request. For example, if a specific request has been manipulated due to an application delivery rule (such as adding a new header) or blocked due to a security rule, the information should be available for analysis.

C. Raw logs – The solution should export or stream logs in standard SIEM formats so it can be easily imported into existing log analysis tools.

D. Common key ID for correlation – Each request should have a persistent request ID that can be used to uniquely identify it. This key must be set the first time the requests touch a proxy.

E. ID sharing mechanism – The request ID should be passed on all hops the requests passes through from the very edge to the origin. This is so the logs generated on the origin include the same ID for correlation of data and incident analysis.

F. On-demand end user request data – The solution should provide data on a specific end user at a specific time. For example, which resource does a user in a specific country under a specific routing regime get from the cache now.

G. Geographic dimension – All requests passed through all data from all proxies at all PoPs as well as data from any relevant end users should be available for analysis and troubleshooting when needed.

The Incapsula Splunk application

Incapsula XRAY

We are happy to introduce Incapsula XRAY, a tool for customers to get better visibility into incidents and their root causes.

As we increase our PoP footprint, advanced caching rules and the expansion of IncapRules for security and application delivery use cases require effective troubleshooting capabilities to support our customers rapid release cycles. XRAY is a result of exploration into ways to improve self-service tools for Incapsula customers.

XRAY gives you an inside look into edge behavior in the Incapsula network using predefined response headers. These include

  • Data on the Incapsula PoP that handles the request
  • The reason for the cache hit or miss
  • The network time between Incapsula and the origin server
  • The processing time on the origin server
  • The IncapRule ID used to block a request

How to Activate XRAY

XRAY is activated by copying an access token from the Management Console to a browser. Once activated, the XRAY debug headers are available for 10 minutes or until the browser session ends.

The XRAY access token is available in the Management Console under Websites > Site Settings > Performance.

Once XRAY is activated, Incapsula will add the following request headers for every response. On a Chrome browser for instance, you can find these headers in the Developer Tools section:

 

 

Header Description
Incap-PoP

The POP that handled the request.

POPs are typically named according to the international airport codes for the nearest airport.

Incap-Origin-PoP The origin PoP configured for your origin server. For details on the Origin PoP setting, see Dynamic Content Acceleration.
Incap-Proxy-ID The ID of the proxy that handled the request.
Incap-Connection Indicates if the connection is a new or existing connection between Incapsula and the origin server.
Incap-Req-ID The ID number assigned to a request. It is written in the Incapsula logs, and can be used to connect between an event in the log and a specific request.
Incap-Cache-Status

Indicates if the response to the request was returned from the Incapsula cache or from the origin server.

hit: content was returned from the cache.

miss: content was returned from the origin server.

synchronous/asynchronous validation: indicates the cache refresh option that is defined for the site. For more details, see Content Optimization.

Incap-Cache-Reason The reason why there was a cache hit (content returned from the cache) or miss (content returned from the origin server).
Incap-Cache-Level The Incapsula cache level that the content was returned from
Incap-Cache-Duration The length of time the resource has been in the cache.
Incap-Cache-TTL

The length of time the resource will remain in the cache.

A negative TTL value indicates that the resource has expired but Incapsula can still serve it from the cache if async validation is used.

Incap-Cache-Key

The cache key identifies the specific cached resource.

It can be useful to compare the cache keys for a resource where different content was received by different users, such as for users from different geographical locations.

Incap-RTT The time it takes for Incapsula to retrieve a resource from the origin server. The round-trip travel time on the network between Incapsula and the origin server.
Incap-cache-tags The list of tags added to the resource by cache rules. The list includes tags that were defined by the Create Tag and Enrich Cache Key cache rules. It does not contain tags defined by the origin. For details on creating cache rules to tag resources, see Cache Settings.
Incap-cache-rules The list of IDs of cache rules triggered by the request. For details on cache rules, see Cache Settings.
Incap-Think-Time The amount of time that the request is being processed on the origin server.
Incap-Blocking

The block type or IncapRules custom ID.

The resource may be blocked based on site settings (security/WAF/DDoS), or based on IncapRules.

Incap-Redirect

The ID of the redirect rule.

The rule ID number is displayed next to the rule name in the Management Console’s Delivery Rules page.

In our docs page for XRAY you’ll find some debugging examples and sample header values. You’ll also find a description of the XRAY tokenization control mechanism that ensures a malicious party doesn’t have access to the data described above.

Starting today, XRAY is available for all Incapsula customers on all plans.

Incapsula End-to-end Troubleshooting Framework

With the release of XRAY Debug Headers, Incapsula now fulfills requirements F and G that completes the list of the seven requirements we described above.

Requirements A, B, C and D are fulfilled by the Incapsula Logs service, which provides raw logs for every request in LEEF, CEF and W3C formats. It also has predefined dashboard apps for Splunk, HPE Arcsight, McAfee Enterprise Security and GrayLog.

Requirement E is fulfilled by sending an Incap-Req-ID header from Incapsula to the origin, which provides the ability to correlate an entry in the Origin log with an entry in the Incapsula WebLogs. This option can be opt-in in the General site settings section.

Here’s how the Incapsula framework looks.

Let us know how XRAY works for you by leaving me a comment.