How pickle works in Python

Last updated on January 20, 2018, in python

The pickle module implements serialization protocol, which provides an ability to save and later load Python objects using special binary format. Unlike json, pickle is not limited to simple objects. It can also store references to functions and classes, as well as the state of class instances.

Before we start, it is worth mentioning, that there are two versions of modules: pickle and cPickle. The latter is faster and implements the same algorithm but in C. The downside of this is that you cannot inherit pickle's classes. In Python 3, the accelerated version is imported automatically when it's possible.

Pickle example

import pickle
import pickletools

class Node:
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, obj):
        self.children.append(obj)


node = Node({"int": 1, "float": 2.0})
data = pickle.dumps(node)

The binary output looks as follows:

>>> data
b'\x80\x03c__main__\nNode\nq\x00)\x81q\x01}q\x02(X\x08\x00\x00\x00childrenq\x03]q\x04X\x04\x00\x00\x00dataq\x05}q\x06(X\x03\x00\x00\x00intq\x07K\x01X\x05\x00\x00\x00floatq\x08G@\x00\x00\x00\x00\x00\x00\x00uub.'

We can use pickletools to convert it to a human readable format:

>>> pickletools.dis(data)
    0: \x80 PROTO      3
    2: c    GLOBAL     '__main__ Node'
   17: q    BINPUT     0
   19: )    EMPTY_TUPLE
   20: \x81 NEWOBJ
   21: q    BINPUT     1
   23: }    EMPTY_DICT
   24: q    BINPUT     2
   26: (    MARK
   27: X        BINUNICODE 'children'
   40: q        BINPUT     3
   42: ]        EMPTY_LIST
   43: q        BINPUT     4
   45: X        BINUNICODE 'data'
   54: q        BINPUT     5
   56: }        EMPTY_DICT
   57: q        BINPUT     6
   59: (        MARK
   60: X            BINUNICODE 'int'
   68: q            BINPUT     7
   70: K            BININT1    1
   72: X            BINUNICODE 'float'
   82: q            BINPUT     8
   84: G            BINFLOAT   2.0
   93: u            SETITEMS   (MARK at 59)
   94: u        SETITEMS   (MARK at 26)
   95: b    BUILD
   96: .    STOP
highest protocol among opcodes = 2

Serialization algorithm

Internally, the serialization algorithm is called a stack-based virtual pickle machine (PM). The name and format can be confusing, but actually, pickle bases on a simple concept.

The pickle protocol (byte stream) contains a set of opcodes each followed by one argument. Opcodes are executed once each, from left to right. To store intermediate results pickle uses two data structures: a stack (based on a list) and a memo (can be based on a list or a dictionary).

To get an idea let's start with a simple example:

>>> pickle.dumps([1,2,3,4])
b'\x80\x03]q\x00(K\x01K\x02K\x03K\x04e.'
>>> pickletools.dis(_)
    0: \x80 PROTO      3
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: K        BININT1    1
    8: K        BININT1    2
   10: K        BININT1    3
   12: K        BININT1    4
   14: e        APPENDS    (MARK at 5)
   15: .    STOP
highest protocol among opcodes = 2

Here PROTO indicates the version of the protocol, which you can change for compatibility with older Python versions. The EMPTY_LIST opcode creates an empty Python list and pushes it on the stack. The MARK opcode is used as a special marker. In our particular case, it indicates the start of the list on the stack.

The BININT1 opcode parses an integer from binary representation and pushes it to the stack. The pickle protocol does not know the number of items in the list in advance, so it keeps pushing values to the stack until the different opcode is reached.

The APPENDS opcode takes all the objects from the top of the stack down to (but not including) the topmost marker object and appends them to a list.

Python implementation of APPENDS:

    def load_appends(self):
        items = self.pop_mark()
        list_obj = self.stack[-1]
        try:
            extend = list_obj.extend
        except AttributeError:
            pass
        else:
            extend(items)
            return
        # Even if the PEP 307 requires extend() and append() methods,
        # fall back on append() if the object has no extend() method
        # for backward compatibility.
        append = list_obj.append
        for item in items:
            append(item)

But how about other objects? How it works for a dictionary, for example? Well, instead of pushing only one value pickle pushes key and value.

    def load_setitems(self):
        items = self.pop_mark()
        dict = self.stack[-1]
        for i in range(0, len(items), 2):
            dict[items[i]] = items[i + 1]

How pickle stores class instances

To serialize class instance we need to know its name and state (i.e., data attributes). In some languages, it requires a complicated class traversing algorithm. However, in Python, all class attributes (except for __slots__) are stored as a dictionary.

Every class has universal __reduce__ and __reduce_ex__ methods which return all necessary data (i.e., class name, object constructor, slots, and its attributes dictionary).

Let's restore our Node class (from the first example):

# Get state using protocol 3
constructor, _, state, _, _ = node.__reduce_ex__(3)
# create an empty instance
# or node = Node.__new__(Node)
node = constructor(Node)
# replace instance's dictionary
instance_dict = node.__dict__
for k, v in state.items():
    instance_dict[k] = v
print(node.data)

Why pickle is not secure

The documentation of pickle module states:

The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Pickle has a REDUCE opcode, which was intended for custom object reconstruction but can be used for evil. It takes the name of the function with its arguments from the stack and immediately executes it. Unfortunately, there are no safety checks.

This is how you can call eval with arbitary code in it:

>>> payload = b"c__builtin__\neval\n(S'print(123)'\ntR."
>>> pickletools.dis(payload)
    0: c    GLOBAL     '__builtin__ eval'
   18: (    MARK
   19: S        STRING     'print(123)'
   33: t        TUPLE      (MARK at 18)
   34: R    REDUCE
   35: .    STOP
highest protocol among opcodes = 0
>>> pickle.loads(payload)
123

More about pickle

If you want to understand more, see pickletools.py for extensive comments about the protocol, and pickle.py for implementation details.

Comments

  • Gael Varoquaux 2018-01-10 #

    Good article, but it's a pity that it's written for Python 2. Im addition, Pickle is getting great improvements in Python 3.

    reply

    • Artem 2018-01-10 #

      It was written using source code from Python 3.7, but with examples using an old Python 3.5.3 REPL.

      reply

  • Muhammad Ramdhan 2018-02-18 #

    what does opcode binput do?

    reply

    • Artem 2018-02-18 #

      It puts previous item (empty list) in the memo at position zero so it can be easily accessed by index.

      reply

  • VP 2020-10-25 #

    Thanks for the article.

    reply