The Catch 22 of Base64: Attacker Dilemma from a Defender Point of View
Web application threats come in different shapes and sizes. These threats mostly stem from web application vulnerabilities, published daily by the vendors themselves or by third-party researchers, followed by vigilant attackers exploiting them.
To cover their tracks and increase their attack success rate, hackers often obfuscate attacks using different techniques. Obfuscation of web application attacks can be extremely complicated, involving custom-made encoding schemes made by the attacker to suit a specific need. Alternatively, and as described in a recent spam campaign research we conducted, obfuscation of web application attacks can be as simple as importing common encoding schemes and re-encoding the attack payloads multiple times.
In this blog post, we’ll dive deep into one of the simplest obfuscation techniques commonly used by web application attackers – Base64 – and uncover some of the traits making it so unique and interesting from the defender perspective.
What is Base64?
Base64 is an encoding mechanism used to represent and stream binary data over mediums limited to printable characters only. The name Base64 comes from the fact that each output character is represented in 6-bits, hence there are characters that can be represented… lower and upper case letters, numbers and the “+” and “/” signs.
Originally, Base64 encoding was used to safely transfer email messages, including binary attachments, over the web. Today, Base64 encoding is widely used to transfer any type of binary data across the web as a means to ensure data integrity at the recipient.
In short, Base64 takes three 8–bits ASCII characters as an input, making it 24-bits in total. It then splits these 24 bits into four parts of six bits each and translates each of the six bits into a character using the Base64 encoding table. If there are less than three characters as an input, the encoding pads the Base64 encoding output using the “=” sign.
Since Base64 is commonly used to encode and transfer data over the web, security controls often decode the traffic as a preprocessing step just before analyzing it. Unfortunately, this encoding technique is often abused and used to carry obfuscated malicious payloads disguised as legitimate Base64-encoded content.
Attacks Encoded in Base64 – The Tells
While Base64 encoding is very useful to transfer binary data over the web, there is no practical need to do multiple encoding of the same text. With that in mind, it’s a common practice among attackers to obfuscate their attacks using multiple encodings of the same text—to the extent of encoding an attack a few dozen times to evade detection.
Thanks to some interesting characteristics of Base64, however, encoding the attack payload multiple times in Base64 actually makes things worse for the attacker and easier for the defender.
1. Inflated Output Size
Every three 8-bits characters encoded in Base64 are transformed into four 6-bits characters, which is why multiple encoding with Base64 increases output. More precisely, the output grows exponentially, multiplying itself by 1.3333 with each encoding (see Figure 1).
2. Fixed Prefix
A unique attribute of Base64 encoding is that each piece of text that is encoded several times will eventually have the same prefix. The first letters of the prefix are forever: “Vm0wd”. This same prefix will always appear when doing multiple Base64 encodings, and the size of the prefix will grow as more encodings are done (Figures 2 and 3).
For more details on the fixed prefix, why it always appears—no matter the input or rate at which its size increases—see the detailed Technical Appendix.
Attacker Lose-Lose Situation
Attackers trying to obfuscate their attacks using multiple Base64 encodings face a problem.
Either they encode their attack payload a small number of times, making it feasible for the defender to decode and identify. Alternatively, they can encode the input multiple times, generating a very large payload making it unfeasible to decode, but also possessing a stronger, fixed, Base64 prefix fingerprint for the defender to detect.
The net net:
Multiple Base64 encoding = Longer fixed prefix = Stronger attack detection fingerprint
There are three primary strategies to consider for mitigation of attacks encoded in Base64:
Attacks encoded multiple times in Base64 may be mitigated by decoding the input several times until the real payload is revealed. This method might seem to work, but it opens a door for another vulnerability – Denial of Service (DoS).
Decoding a very long text multiple times may take a lot of time. While attackers need to create the long encoded attack only once, the defender must decode it on every incoming request in order to identify and mitigate the attack in full.
Thus, decoding the input several times opens the door for attackers to launch DoS attacks by sending several long encoded texts. Additionally, even if the defender decodes the input many times, say ten, the attacker can just encode the attacks once more and evade detection.
So, decoding the input multiple times is neither sufficient nor efficient when the attacks are encoded multiple times. Specifically, in the case of Base64, thanks to the special characteristics of the encoding scheme, there are other ways to mitigate multiple encodings.
Suspicious Content Detection
As described above, increasing Base64 encoding = longer fixed prefix = stronger attack detection fingerprint. In accordance, defenders can easily detect and mitigate attacks heavily obfuscated by multiple Base64 encoding.
A web application firewall (WAF) can offer protection based on this detection. Imperva’s cloud and on-prem WAF customers are protected out of the box from these attacks by utilizing the fixed prefix fingerprint phenomena, and based on the assumption that legitimate users have no practical need to do multiple encoding of the same text.
Abnormal Requests Detection
As discussed earlier, increased Base64 encoding equates to increased payload output size. Subsequently, defenders can determine the size of a legitimate incoming payload/parameter /header value, and block inflated payloads, exceeding the predefined limits.
Imperva’s cloud and on-prem WAF customers are protected out of the box here as well. By integrating both web application profiling that understands incoming traffic to the application over time and identifies abnormalities when they occur, and HTTP hardening policies that enforce illegal protocol behavior like abnormally long requests.
Base64 is a popular encoding used to transfer data over the web. It is also often used by attackers to obfuscate their attacks by encoding the same payload several times. Due to some of the characteristics of Base64 encoding, it is possible to detect and mitigate attacks that are obfuscated with several Base64 encodings. To read more about these characteristics see the technical appendix. You can also read more about mitigation techniques using a web application firewall.
How Base64 Works
The basic idea behind the Base64 encoding technique is to take three characters, each represented in 8-bits, and turn them into four characters, each represented in 6-bits.
In more detail… assume we get three characters in ASCII. Each character is mapped to an 8-bit number between 0 and 255 based on the ASCII table (see Figure 4). We take the representation of the three characters in 8-bits and join them together to get 24-bits. Next, we split these 24-bits into four parts with 6-bits in each part, and translate each part using the Base64 table (Figure 5). Each 6-bits have 64 options of characters (hence the name Base64), the characters available are numbers, lowercase and uppercase letters, and the symbols ‘+’ and ‘/’.
Overall, Base64 encoding splits the input text into parts of three characters and encodes the three characters as described above. At the end of the process, we might run into a problem where we miss one or two characters to complete the last trio. To solve this, the encoding adds one or two ‘0’ characters at the end to create the last 3-byte group. Then, the Base64 encoding transforms the last characters into ‘=’. That is why sometimes we see Base64-encoded text that ends with one or two ‘=’ characters.
Here is an example of how Base64 works on a simple three-character word (Figure 6):
The fixed prefix
No matter what string is encoded, after encoding to Base64 multiple times, we always end up with the same fixed prefix, which starts with: “Vm0wd”. The reason for this phenomenon is the way the encoding works, and surprisingly, how the letter ‘V’ behaves under the encoding.
First, let’s try to encode the letter ‘V’ using base64. In ASCII, the letter ‘V’ is 86, which in 8-bits representation translates to: 01010110. After encoding and ignoring the padding, as we are interested only in the prefix, we take only the first 6 bits of the representation, which means 010101. In base 64, this is 21, which surprisingly is also ‘V’. This means that every time we try to encode anything that starts with the letter ‘V’ we will end up with an encoded string that also starts with ‘V’ (!). This is a never-ending loop.
|Letter||ASCII (8 bits)||Base64 (6 bits)|
After checking the rest of the characters, ‘V’ is the only one that has this special attribute. So, ‘V’ is the only character that we can put at the beginning of the string we want to encode and end up with the same character at the beginning of the encoded string.
The next question is if we encode some random string using Base64, will we always get an encoded string that starts with ‘V’ after a couple of encodings? The answer is yes.
Below is a graph showing, bottom up, the Base64 re-encoding outcome for each ASCII-readable character and digit. Each color represents the encoding distance to ‘V’: blue – four encoding iterations; green – three encoding iterations; yellow – two encoding iterations; orange – one encoding iteration. For instance, it takes four encoding iterations to get to ‘V’ from ‘k’ (k->a->Y->W->V) and two iterations from ‘P’ (P->U->V). Overall, the minimal number of iterations getting to ‘V’ is, of course, 0 (‘V’->’V’ J) while the maximum number of iterations is 5 (for instance, starting with the 128 ASCII char ->w->d->Z->W->V ).
After the ‘V’ in the prefix is set, more encodings will result in longer fixed prefixes. We tested all the available characters and saw that it takes at most two more encodings to get the next prefix character “m”, and at most two more encodings to get the next character “0”.
Before going forward to longer prefixes let’s try to understand why this phenomenon happens. We take the string ‘Vm0’ and encode it using Base64:
What happened here is that the first 6-bits of ‘V’ in its 8-bit representation are exactly its 6-bit representation. Now, taking the extra 2-bits from its 8-bits representation and adding the first 4-bits of the representation of ‘m’ in 8-bits gives exactly the 6-bits representation of ‘m’. The same logic goes with the representation of ‘0’. Note that we are left a remaining 6-bits, which is the representation of ‘w’ in 6-bits. Meaning that what makes the ‘Vm0’ prefix special is that its 8-bits representation is similar to its 6-bits representation.
Inflation of the prefix
It is noteworthy that after encoding the first three letters of the fixed prefix, there is a leftover of 6-bits. These 6-bits will determine the next letter of the prefix. In fact, for every three letters added to the fixed prefix, after encoding there are an extra 6-bits left which will determine an extra character of the prefix. This means that the fixed prefix will inflate in each extra encoding by the number of letters in the prefix, divided by three. For example, if there are nine characters in the fixed prefix, then after another encoding there will be twelve characters in the fixed prefix.