How pickle works in Python
The pickle
module implements serialization protocol, which provides an ability to save and later load Python objects using special binary format. Unlike json
, pickle is not limited to simple objects. It can also store references to functions and classes, as well as the state of class instances.
Before we start, it is worth mentioning, that there are two versions of modules: pickle
and cPickle
. The latter is faster and implements the same algorithm but in C. The downside of this is that you cannot inherit pickle's classes. In Python 3, the accelerated version is imported automatically when it's possible.
Pickle example
import pickle
import pickletools
class Node:
def __init__(self, data):
self.data = data
self.children = []
def add_child(self, obj):
self.children.append(obj)
node = Node({"int": 1, "float": 2.0})
data = pickle.dumps(node)
The binary output looks as follows:
>>> data
b'\x80\x03c__main__\nNode\nq\x00)\x81q\x01}q\x02(X\x08\x00\x00\x00childrenq\x03]q\x04X\x04\x00\x00\x00dataq\x05}q\x06(X\x03\x00\x00\x00intq\x07K\x01X\x05\x00\x00\x00floatq\x08G@\x00\x00\x00\x00\x00\x00\x00uub.'
We can use pickletools
to convert it to a human readable format:
>>> pickletools.dis(data)
0: \x80 PROTO 3
2: c GLOBAL '__main__ Node'
17: q BINPUT 0
19: ) EMPTY_TUPLE
20: \x81 NEWOBJ
21: q BINPUT 1
23: } EMPTY_DICT
24: q BINPUT 2
26: ( MARK
27: X BINUNICODE 'children'
40: q BINPUT 3
42: ] EMPTY_LIST
43: q BINPUT 4
45: X BINUNICODE 'data'
54: q BINPUT 5
56: } EMPTY_DICT
57: q BINPUT 6
59: ( MARK
60: X BINUNICODE 'int'
68: q BINPUT 7
70: K BININT1 1
72: X BINUNICODE 'float'
82: q BINPUT 8
84: G BINFLOAT 2.0
93: u SETITEMS (MARK at 59)
94: u SETITEMS (MARK at 26)
95: b BUILD
96: . STOP
highest protocol among opcodes = 2
Serialization algorithm
Internally, the serialization algorithm is called a stack-based virtual pickle machine (PM). The name and format can be confusing, but actually, pickle bases on a simple concept.
The pickle protocol (byte stream) contains a set of opcodes each followed by one argument. Opcodes are executed once each, from left to right. To store intermediate results pickle uses two data structures: a stack (based on a list
) and a memo (can be based on a list
or a dictionary
).
To get an idea let's start with a simple example:
>>> pickle.dumps([1,2,3,4])
b'\x80\x03]q\x00(K\x01K\x02K\x03K\x04e.'
>>> pickletools.dis(_)
0: \x80 PROTO 3
2: ] EMPTY_LIST
3: q BINPUT 0
5: ( MARK
6: K BININT1 1
8: K BININT1 2
10: K BININT1 3
12: K BININT1 4
14: e APPENDS (MARK at 5)
15: . STOP
highest protocol among opcodes = 2
Here PROTO
indicates the version of the protocol, which you can change for compatibility with older Python versions. The EMPTY_LIST
opcode creates an empty Python list and pushes it on the stack. The MARK
opcode is used as a special marker. In our particular case, it indicates the start of the list on the stack.
The BININT1
opcode parses an integer from binary representation and pushes it to the stack. The pickle protocol does not know the number of items in the list in advance, so it keeps pushing values to the stack until the different opcode is reached.
The APPENDS
opcode takes all the objects from the top of the stack down to (but not including) the topmost marker object and appends them to a list.
Python implementation of APPENDS
:
def load_appends(self):
items = self.pop_mark()
list_obj = self.stack[-1]
try:
extend = list_obj.extend
except AttributeError:
pass
else:
extend(items)
return
# Even if the PEP 307 requires extend() and append() methods,
# fall back on append() if the object has no extend() method
# for backward compatibility.
append = list_obj.append
for item in items:
append(item)
But how about other objects? How it works for a dictionary, for example? Well, instead of pushing only one value pickle pushes key and value.
def load_setitems(self):
items = self.pop_mark()
dict = self.stack[-1]
for i in range(0, len(items), 2):
dict[items[i]] = items[i + 1]
How pickle stores class instances
To serialize class instance we need to know its name and state (i.e., data attributes). In some languages, it requires a complicated class traversing algorithm. However, in Python, all class attributes (except for __slots__
) are stored as a dictionary.
Every class has universal __reduce__
and __reduce_ex__
methods which return all necessary data (i.e., class name, object constructor, slots, and its attributes dictionary).
Let's restore our Node
class (from the first example):
# Get state using protocol 3
constructor, _, state, _, _ = node.__reduce_ex__(3)
# create an empty instance
# or node = Node.__new__(Node)
node = constructor(Node)
# replace instance's dictionary
instance_dict = node.__dict__
for k, v in state.items():
instance_dict[k] = v
print(node.data)
Why pickle is not secure
The documentation of pickle
module states:
The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
Pickle has a REDUCE
opcode, which was intended for custom object reconstruction but can be used for evil. It takes the name of the function with its arguments from the stack and immediately executes it. Unfortunately, there are no safety checks.
This is how you can call eval
with arbitary code in it:
>>> payload = b"c__builtin__\neval\n(S'print(123)'\ntR."
>>> pickletools.dis(payload)
0: c GLOBAL '__builtin__ eval'
18: ( MARK
19: S STRING 'print(123)'
33: t TUPLE (MARK at 18)
34: R REDUCE
35: . STOP
highest protocol among opcodes = 0
>>> pickle.loads(payload)
123
More about pickle
If you want to understand more, see pickletools.py for extensive comments about the protocol, and pickle.py for implementation details.
Good article, but it's a pity that it's written for Python 2. Im addition, Pickle is getting great improvements in Python 3.