Bits, Please!: May 2016

In this blog post we'll discover and exploit a vulnerability which will allow us to gain code execution within Qualcomm's Secure Execution Environment (QSEE). I've responsibly disclosed this vulnerability to Google and it has been fixed - for the exact timeline, see the "Timeline" section below.

The QSEE Attack Surface

As we've seen in the previous blog post, Qualcomm's TrustZone implementation enables the "Normal World" operating system to load trusted applications (called trustlets) into a user-space environment within the "Secure World", called QSEE.

This service is provided to the "Normal World" by sending specific SMC calls which are handled by the "Secure World" kernel. However, since SMC calls cannot be invoked from user-mode, all communication between the "Normal World" and a trustlet must pass through the "Normal World" operating system's kernel.

Having said that, regular user-space processes within the "Normal World" sometimes need to communicate with trustlets which provide specific services to them. For example, when playing a DRM protected media file, the process in charge of handling media within Android, "mediaserver", must communicate with the appropriate DRM trustlet in order to decrypt and render the viewed media file. Similarly, the process in charge of handling cryptographic keys, "keystore", needs to be able to communicate with a special trustlet ("keymaster") which provides secure storage and operation on cryptographic keys.

So if communicating with trustlets requires the ability to issue SMCs, and this cannot be done from user-mode, then how do these processes actually communicate with the trustlets?

The answer is by using a Linux kernel device, called "qseecom", which enables user-space processes to perform a wide range of TrustZone-related operations, such as loading trustlets into the secure environment and communicating with loaded trustlets.

However! Although necessary, this is very dangerous; communication with TrustZone exposes a large (!) attack surface - if any trustlet that can be loaded on a particular device contains a vulnerability, we can exploit it in order to gain code execution within the trusted execution environment. Moreover, since the trusted execution environment has the ability to map-in and write to all physical memory belonging to the "Normal World", it can also be used in order to infect the "Normal World" operating system's kernel without there even being a vulnerability in the kernel (simply by directly modifying the kernel's code from the "Secure World").

Because of the dangers outlined above, the access to this device is restricted to the minimal set of processes that require it. A previous dive into the permissions required in order to access the driver has shown that only four processes are able to access "qseecom":

surfaceflinger (running with "system" user-ID)
drmserver (running with "drm" user-ID)
mediaserver (running with "media" user-ID)
keystore (running with "keystore" user-ID)

This means that if we manage to get a hold of any of these four processes, we would then be able to directly attack any trustlet of our choice, directly bypassing the Linux kernel in the process! In fact, this is exactly what we'll do - but we'll get to that later in the series.

For this blog post, let's assume that we already have code-execution within the "mediaserver" process, thus allowing us to directly focus on the attack surface provided by trustlets. Here's an illustration to help visualise the path of the exploit chain we'll cover during the series and the focus of this post:

Vulnerability Scope

I haven't been able to confirm the exact scope of this issue. I've statically checked quite a few devices (such as the Droid Turbo, Nexus 6, Moto X 2nd Gen), and they were all vulnerable. In fact, I believe the issue was very wide-spread, and may have affected most Qualcomm-based devices at the time.

So why was this issue so prevalent? As we'll see shortly, the vulnerability is contained in a trustlet and so does not rely on the TrustZone kernel (which tends to change substantially between SoCs), but rather on code which is designed to be able to execute in the same manner on many different devices. As such, all devices containing the trustlet were made vulnerable, regardless of their SoC.

Also note that on some devices the vulnerable code was present but appeared slightly different (it may have been an older version of the same code). Those devices are also vulnerable, although the indicators and strings you might search for could be slightly different. This means that if you're searching for the exact strings mentioned in this post and don't find them, don't be dissuaded! Instead, reverse-engineer the trustlet using the tools from the previous blog post, and check for yourself.

Enter Widevine

Previously, we decided to focus our research efforts on the "widevine" trustlet, which enables playback of DRM encrypted media using Widevine's DRM platform. This trustlet seems like a good candidate since it is moderately complex (~125KB) and very wide-spread (according to their website, it is available on over 2 billion devices).

After assembling the raw trustlet, we are left with an ELF file, waiting to be analysed. Let's start by taking a look at the function registered by the trustlet in order to handle incoming commands:

As we can see, the first 32-bit value in the command is used to specify the command code, the high-word of which is used to sort the commands into four different categories.

Taking a peek at each of the category-handling functions reveals that the categories are quite rich - all in all, there are about 70 different supported commands - great! However, going over 70 different commands would be a pretty lengthy process - perhaps we can find a shortcut that'll point us in the right direction? For example, maybe there's a category of commands that were accidentally left in even though they're not used on production devices?

Since the libraries which are used to interact with the trustlets are also proprietary, we can't look through the source code to find the answers. Instead, I wrote a small IDAPython script to lookup all the references to "QSEECom_send_cmd", the function used to send commands to trustlets, and check what the "command-code" value is for each reference. Then I simply grouped the results into the categories above, producing the following results:

So... Nobody is using 5X commands. Suspicious!

Sifting through the functions in the 5X category, we reach the following command:

Pretty straight-forward: copies the data from our request buffer into a "malloc"-ed buffer (note that the length field here is not controlled by us, but is derived from the real buffer length passed to QSEOS). Then, the function's flow diverges according to a flag in our request buffer. Let's follow the flow leading to "PRDiagVerifyProvisioning":

Finally, we found a vulnerability!

After some simple validation (such as checking that the first DWORD in the command buffer is indeed zero), the function checks the value of the fourth DWORD in our crafted command buffer. As we can see above, setting that value to zero will lead us to a code-path in which a fully-controlled copy is performed from our command buffer into some global buffer, using the third DWORD as the length argument. Since this code-path only performs the vulnerable memcpy and nothing else, it is much more convenient to deal with (since it doesn't have unwanted side-effects), so we'll stick to this code-path (rather than the one above it, which seems more complex).

Moreover, you might be wondering what is the "global buffer" that's referred to in the function above. After all, it looks a little strange - it isn't passed in to the function at any point, by is simply referred to "magically", by using the register R9.

Remember how the trustlets that we analysed in the previous blog post had a large read-write data segment? This is the data segment in which all the modifiable data of the trustlet is stored - the stack, the heap and the global variables. In order to quickly access this segment from any location in the code, Qualcomm decided to use the platform-specific R9 register as a "global register" whose value is never modified, and which always points to the beginning of the aforementioned segment. According to the ARM AAPCS, this is actually valid behaviour:

What now?

Now that we have a primitive, it's time to try and understand which pieces of data are controllable by our overflow. Again, using a short IDAPython script, we can search for all references to the "global buffer" (R9) which reside after the overflown buffer's start address (that is, after offset 0x10FC). Here are the results:

Disappointingly, nearly all of these functions don't perform any "meaningful" operations of the controllable pieces of data. Specifically, the vast majority of these functions simply store file-system paths in those memory locations, which imply no obvious way to hijack control flow.

Primitive Technology

Since there aren't any function pointers or immediate ways to manipulate control flow directly after the overflown buffer, we'll need to upgrade our buffer overflow primitive into a stronger primitive before we can gain code execution.

Going through the list of functions above, we come across interesting block of data referred to by several functions:

As you can see above, the block of 0x32 DWORDs, starting at offset 0x169C, are used to store "sessions". Whenever a client sends commands to the Widevine trustlet, they must first create a new "session", and all subsequent operations are performed using the specific session identifier issued during the session's creation. This is needed, for example, in order to allow more than a single application to decrypt DRM content at the same time while having completely different internal states.

In any case, as luck would have it, the sessions are complex structures - hinting that they may be used in order to subtly introduce side-effects in our favour. They are also within our line-of-fire, as they are stored at an offset greater than that of the overflown buffer. But, unfortunately, the 0x32 DWORD block mentioned above only stores the pointers to these session objects, not the objects themselves. This means that if we want to overwrite these values, they must point to addresses which are accessible from QSEE (otherwise, trying to access them will simply result in the trustlet crashing).

Finding Ourselves

In order to craft legal session pointers, we'll need to find out where our trustlet is loaded. Exploring the relevant code reveals that QSEOS goes to great lengths in order to protect trustlets from the "Normal World". This is done by creating a special memory region, referred to as "secapp-region", from which the trustlet's memory segments are carved. This area is also protected by an MPU, which prevents the "Normal World" from accessing it in any way (attempting to access those physical addresses from the "Normal World" causes the device to reset).

On the other hand, trustlets reside within the secure region and can obviously access their own memory segments. Not only that, but in fact trustlets can access all allocated memory within the "secapp" region, even memory belonging to other trustlets! However, any attempt to access unallocated memory within the region results in the trustlet immediately crashing.

...Sounds like we're beginning to form a plan!

We can use the overflow primitive in order to overwrite a session pointer to a location within the "secapp" region. Now, we can find a command which causes a read attempt using our poisoned session pointer. If the trustlet crashes after issuing the command, we guessed wrong (in that case, we can simply reload the trustlet). Otherwise, we found an allocated page in the "secapp" region.

But... How do we know which trustlet that page belongs to?

We already have a way to differentiate between allocated and unallocated pages. Now, we need some way to distinguish between pages based on their contents.

Here's an idea - let's look for a function that behaves differently based on the read value in a the session pointer:

Okay! This function tries to access the data at session_pointer + 0xDA. If that value is equal to one, it will return the value 24, otherwise, it will return 35.

This is just like finding a good watermelon; by "tapping" on various memory locations and listening to the "sound" they make, we can deduce something about their contents. Now we just need to give our trustlet a unique "sound" that we can identify by tapping on it.

Since we can only listen to differences between one and non-one values, let's mark our trustlet by creating a unique pattern containing ones and zeros within it. For example, here's a pattern which doesn't occur in any other trustlet:

Now, we can simply write this pattern to the trustlet's data segment by using over overflow primitive, effectively giving it its own distinct "sound".

Finally, we can repeat the following strategy until we find the trustlet:

Randomly tap a memory location in the "secapp" region:

If it sounds "hollow" (i.e., the trustlet crashes) - there's nothing there, so reload our trustlet
Otherwise, tap the sequence of locations within the page which should contain our distinct marking pattern. If it sounds like the pattern above, we found our trustlet

Of course, inspecting the allocation scheme used by QSEOS could allow us to speed things further by only checking relevant memory locations. For example, QSEOS seems to allocate trustlets consecutively, meaning that simply scanning from the end of the "secapp" region to its beginning using increments of half the trustlet's size will guarantee a successful match.

A (messy) write primitive

Now that we have a way to find the trustlet in the secure region, we are able to craft "valid" session pointers, which point to locations within the trustlet. Next up, we need to find a way to create a write primitive. So... are there any functions which write controllable data into a session pointer?

Surprisingly, nearly all functions that do write data to the session pointer do not allow for arbitrary control over the data being written. Nonetheless, one function looks like it could be of some help:

This function generates a random DWORD to be used as a "nonce", then checks if enough time elapsed since the previous time it was called. If so, it adds the random value to the session pointer by calling "addNonceToCache".

First of all, since the "time" field is saved in the global buffer after our overflown buffer, we can easily clear it using our overflow primitive, thus removing the time limitation and allowing us to call the function as frequently as we'd like. Also, note that the generated nonce's random value is written into a buffer which is returned to the user - this means that after a nonce is generated, the caller also learns the value of the nonce.

Let's take a peek at how the nonces are stored in the session pointer:

So there's an array of 16 nonces in the session object - starting at offset 0x88. Whenever a nonce is added, all the previous nonce values are "rolled over" one position to the right (discarding the last nonce), and the new nonce is written into the first location in the nonce array.

See where we're going with this? This is actually a pretty powerful write primitive (albeit a little messy)!

Whenever we want to write a value to a specific location, we can simply set the session pointer to point to that location (minus the offset of the nonces array). Then, we can start generating nonces, until the least-significant byte (this is a little-endian machine) in the generated nonce matches the byte we would like to write. Then, once we get a match, we can increment the session pointer by one, generate the next byte, and so forth.

This allows us to generate any arbitrary value with an expectancy of only 256 nonce-generation calls per byte (since this is a geometric random variable). But at what cost?

Since the values in the nonce cache are "rotated" after every call, this means that we mess-up the 15 DWORDs after the last written memory location. We'll have to work our way around that when we design the exploit.

Writing an exploit

We finally have enough primitives to craft a full exploit! All we need to do is find a value that we can overwrite using the messy write primitive, which will allow us to hijack the control flow of the application.

Let's take a look at the function in charge of handling the "6X" category of commands:

As you can see, the function calls the requested commands by using the command ID specified as an index into an array stored in the global buffer. Each supported command is represented by a 12-byte entry in the array, containing four pieces of information:

The command code (32-bits)
A pointer to the handling function itself (32-bits)
The minimal input length (16-bits)
The minimal output length (16-bits)

If this information is valid, the function pointer is executed, passing in the user's input buffer as the first argument and the output buffer as the second argument.

If we choose an innocuous 6X command, we can overwrite the corresponding entry in the array above so that its function pointer will be directed at any piece of code we'd like to execute. Then, simply calling this command will cause the trustlet to execute the code at our controlled memory location. Great!

We should be wary, however, not to choose a function which lies directly before an "important" command that we might need later. This is because our messy write primitive will destroy the following 15 DWORDs (or rather, the next 5 array entries). Let's take a look at the function which populates the entries in the command array:

There are six consecutive entries corresponding to unused functions. Therefore, if we choose to overwrite the entry directly before them, we'll stay out of trouble.

Universal Shellcode Machine

Although we can now hijack the control flow, we still can't quite execute arbitrary code within QSEE yet. The regular course of action at point would probably be to find a stack-pivot gadget and write a short ROP chain which will enable us to allocate shellcode - however, since the trustlets' code segments aren't writeable, and the TrustZone kernel doesn't expose any system call to QSEE to allow the allocation of new executable pages, we are left with no way to create executable shellcode.

So does this mean we need to write all our logic as a ROP chain? That would be extremely inconvenient (even with the aid of automatic "ROP"-compilers), and might even not be possible if the ROP gadgets in the trustlet are not Turing-Complete.

Luckily, after some careful consideration, we can actually avoid the need to write longer ROP chain. If we think of our shellcode as a Turing Machine, we would like to create a "Universal Turing Machine" (or simulator), which will enable us to execute any given shellcode as if it were running completely within QSEE.

Given a piece of code, we can easily simulate all the control-flow and logic in the "Normal World", simply by executing the code fully in the "Normal World". But what about operations which behave differently in a QSEE-context? If we think about it, there are only a few such operations:

Reading and writing memory
Calling system calls exposed by the TrustZone kernel

These operations must execute within QSEE. However, we can actually execute both of these operations in QSEE by writing one small ROP chain!

All we need is a single ROP chain which will:

Hijack control flow to a separate stack
Prepare arguments for a function call
Call the wanted QSEE function
Return the result to the user and restore execution in QSEE

As you can see, all this chain do is to enable us to execute any given QSEE function using any supplied arguments. But how can we use it to simulate the special operations?

Well, since all system-calls in QSEE have matching calling-stubs in each trustlet, we can use our ROP chain to execute any system call with ease. As for memory accesses - there is an abundance of QSEE functions which can be used as read and write gadgets. Hence, both operations are simple to execute using our short ROP chain.

This leaves us with the following model:

This also means that executing arbitrary shellcode in QSEE doesn't require any engineering effort! All the shellcode developer needs to do is to delegate memory accesses and system calls to specific APIs exposed by the exploit. The rest of the shellcode's logic can remain unchanged and execute completely in the "Normal World". We'll see an example of some shellcode using this model shortly.

Finding a stack pivot

In order to execute a ROP chain, we need to find a convenient stack-pivot gadget. When dealing with large or medium-sized applications, this is not a daunting task - there is simply enough code for us to find at least one gadget that we can use.

However, since we're only dealing with ~125KB of code, we might not be that lucky. Not only that, but at the point at which we hijack the control flow, we only have control over the registers R0 and R1, which point to the input and output buffers, respectively.

After fully disassembling the trustlet's code we are faced with the harsh truth - it seems as though there is no usable stack pivot using our controlled registers. So what can we do?

Recall that ARM opcodes can be decoded in more than one way, depending on the value of the T bit in the CPSR. When the bit is set, the processor is executing in "Thumb" mode, in which the instruction length is 16-bits. Otherwise, the processor is in "ARM" mode, with an instruction length of 32-bits.

We can easily switch between these modes by using the least-significant bit of the PC register when performing a jump. If the least-significant bit is set, the T bit will be set, and the processor will switch to "Thumb" mode. Otherwise, it will switch to ARM mode.

Looking at the trustlet's code - it seems to contain mostly "Thumb" instructions. But perhaps if we were to forcibly decode the instructions as if they were "ARM" instructions, we'd be able to find a hidden stack pivot which was not visible beforehand.

Indeed, that is the case! Searching through the ARM opcodes reveals a very convenient stack-pivot:

By executing this opcode, we will be able to fully control the stack pointer, program counter and other registers by using the values stored in R0 - which, as we saw above, points to the fully user-controlled input buffer. Great!

As for the rest of the ROP chain - it is pretty standard. In order to execute a function and return all we need to do is build a short chain which:

Sets the low registers (R0-R3) and the stack arguments to the function's arguments
Set the link register to point to the rest of our chain
Jump to the function's start address
When control is returned via the crafted LR value, store the return value in user-accessible memory location, such as the supplied output buffer
Restore the stack pointer to the original location and return to the location from which control was originally hijacked

You can find the complete ROP chain and gadgets in the provided exploit code, but I imagine it's exactly what you'd expect.

Putting it all together

At long last, we have all the pieces needed to create a fully functional exploit. Here's a short run-down of the exploit's stages:

Find the Widevine application by repeatedly "tapping" the secapp region and "listening"
Create a "messy" write gadget using the nonce-generation command
Overwrite an unused 6X command entry using the write gadget to direct it to a stack-pivot
Execute any arbitrary code using a small ROP chain under the "Universal Shellcode Machine"

The Exploit

As always, the full exploit code is available here:

https://github.com/laginimaineb/cve-2015-6639

I've also included a sample shellcode using the model described earlier. The shellcode reads a file from TrustZone's secure file-system - SFS. This file-system is encrypted using a special hardware key which should be inaccessible to software running on the device - you can read more about it here. Regardless, running within the "Secure World" allows us to access SFS fully, and even extract critical encryption keys, such as those used to decrypt DRM content.

In fact, this is all it takes:

Also, please note that there are quite a few small details that I did not go into in this blog post (for brevity’s sake, and to keep it interesting). However, every single detail is documented in the exploit's code. So by all means, if you have any unanswered questions regarding the exploit, I encourage you to take a look at the code and documentation.

What's next?

Although we have full code-execution within QSEE, there are still some things beyond our reach. Specifically, we are limited only to the API provided by the system-calls exposed by the TrustZone kernel. For example, if we were looking to unlock a bootloader, we would probably need to be able to blow the device's QFuses. This is, understandably, not possible from QSEE.

With that in mind, in the next blog post, we'll attempt to further elevate our privileges from QSEE to the TrustZone kernel!

Timeline

27.09.2015 - Vulnerability disclosed
27.09.2015 - Initial response from Google
01.10.2015 - PoC sent to Google
14.12.2015 - Vulnerability fixed, patch distributed

I would also like to mention that on 19.10.2015 I was notified by Google that this issue has already been internally discovered and reported by Qualcomm. However, for some reason, the fix was not applied to Nexus devices.

Moreover, there are quite a few firmware images for other devices (such as the Droid Turbo) that I've downloaded from that same time period that appeared to still contain the vulnerability! This suggests that there may have been a hiccup when internally reporting the vulnerability or when applying the fix.

Regardless, as Google has included the issue in the bulletin on 14.12.2015, any OEMs that may have missed the opportunity to patch the issue beforehand, got another reminder.

Bits, Please!

05/05/2016

War of the Worlds - Hijacking the Linux Kernel from QSEE

02/05/2016

QSEE privilege escalation vulnerability and exploit (CVE-2015-6639)