How to patch Python bytecode
In standard Python, when executing a script, the raw source code is compiled into platform-independent bytecode which subsequently runs on Python's stack-based virtual machine.
Сode objects
Code objects represent blocks of bytecode. According to the Python documentation, there are three types of blocks (namespaces): a module, a function body, and a class definition. Such objects are produced whenever a block of Python code is compiled, e.g., at the startup or during execution time.
You can access code objects from Python code. Let's create a function to play with:
def wadd(x, y=1):
pow_n = 3
result = (x + y) ** pow_n
return abs(result)
Code object lives at the __code__
dunder method. Let's explore its attributes:
>>> for attr in dir(wadd.__code__):
... if attr.startswith('co_'):
... print("\t%s = %s" % (attr, wadd.__code__.__getattribute__(attr)))
...
co_argcount = 2
co_cellvars = ()
co_code = b'd\x01\x00}\x02\x00|\x00\x00|\x01\x00\x17|\x02\x00\x13}\x03\x00t\x00\x00|\x03\x00\x83\x01\x00S'
co_consts = (None, 3)
co_filename = <stdin>
co_firstlineno = 1
co_flags = 67
co_freevars = ()
co_kwonlyargcount = 0
co_lnotab = b'\x00\x01\x06\x01\x0e\x01'
co_name = wadd
co_names = ('abs',)
co_nlocals = 4
co_stacksize = 2
co_varnames = ('x', 'y', 'pow_n', 'result')
To get an idea about these fields, you can read the documentation of the inspect module. Most of the fields are pretty self-explanatory, except for the co_code
and co_lnotab
.
The co_code
field contains a sequence of bytecode instructions. Each instruction occupies two bytes, one for instruction code and one for the corresponding argument, whereas the co_lnotab
field contains mappings from bytecode instructions to the corresponding lines in the source code.
Bytecode disassembling
The builtin dis
module comes in handy when you want to read bytecode in a human-readable format:
>>> import dis
>>> dis.dis(wadd)
2 0 LOAD_CONST 1 (3)
2 STORE_FAST 2 (pow_n)
3 4 LOAD_FAST 0 (x)
6 LOAD_FAST 1 (y)
8 BINARY_ADD
10 LOAD_FAST 2 (pow_n)
12 BINARY_POWER
14 STORE_FAST 3 (result)
4 16 LOAD_GLOBAL 0 (abs)
18 LOAD_FAST 3 (result)
20 CALL_FUNCTION 1
22 RETURN_VALUE
The first number is the corresponding line number in the source code (thanks to co_lnotab
). The next blocks contain three columns: an offset of the instruction in the bytecode, instruction name and an argument with a human-readable representation in parentheses (if any).
A complete list of CPython's instructions can be found here. The actual implementation of each instruction is located in ceval.c file.
Bytecode patching
Imagine, you have a bug in someone else's module, and you can't edit module's files. One of the solutions is to patch bytecode at runtime!
All code objects are immutable, so we need to create a new one. For example, let's replace the add operator in our function:
from types import CodeType
def fix_function(func, payload):
fn_code = func.__code__
func.__code__ = CodeType(fn_code.co_argcount,
fn_code.co_kwonlyargcount,
fn_code.co_nlocals,
fn_code.co_stacksize,
fn_code.co_flags,
payload,
fn_code.co_consts,
fn_code.co_names,
fn_code.co_varnames,
fn_code.co_filename,
fn_code.co_name,
fn_code.co_firstlineno,
fn_code.co_lnotab,
fn_code.co_freevars,
fn_code.co_cellvars,
)
payload = wadd.__code__.co_code
# replace BINARY_ADD (0x17) at position #12 with BINARY_SUBTRACT (0x18)
subtract_opcode = dis.opmap['BINARY_SUBTRACT'].to_bytes(1, byteorder='little')
payload = payload[0:12] + subtract_opcode + payload[13:]
wadd(3, 1) # The result is: 64
# Now it's (x - y) instead of (x+y)
fix_function(wadd, payload)
wadd(3, 1) # The result is: 8
Moreover, you can change other fields too. For example, you can edit constant variables, arguments, replace globals with locals. You can even create new statement.
To simplify the process of editing bytecode you can use special modules, such as bytecode and codetransformer.
Thank you, it was tremendously useful.