Python internals: Arbitrary-precision integer implementation
Have you ever noticed that Python supports integers of any size? Here is a quick glance at it.
Python represents all objects by C structures. The following data structure is responsible for all integer objects:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
} PyLongObject;
After macros expansions, a simplified version of the struct looks as follows:
struct {
ssize_t ob_refcnt;
struct _typeobject *ob_type;
ssize_t ob_size;
uint32_t ob_digit[1];
};
The ob_refcnt
field is responsible for reference counting technique which is used in garbage collection mechanism, whereas ob_type
is a pointer to a structure which describes an integer type.
Generally, In languages like C/C++, the precision of integers is limited to 64-bit, but Python has built-in support for Arbitrary-precision integers. Since Python 3 there is no longer simple integer type, and all integers are represented as a bignum.
How to store arbitrarily large integer?
One of the solutions is to represent the integer as an array of digits. To do it in an efficient way we need to convert our number from base 10 to base $2^{30}$ numeral system, where each element represents a single 'digit' in a range from $0$ to $2^{30}-1$. Depending on the platform, Python uses either 32-bit unsigned integer arrays with 30-bit digits or 16-bit unsigned integer arrays with 15-bit digits. Using such approach introduces additional requirements, that's why we can't use all bits of the integer. The ob_digit
field in a structure above is responsible for such arrays.
To eliminate unnecessary computation CPython has a "fast path" implementation for integers in a range of $-2^{30}$ to $2^{30}$. Such integers are stored as an array of one element and if it's possible treated as fixed 32-bit integers.
Note that, unlike classical approach, the sign of an integer is stored separately in ob_size
field. This field stores the size of the ob_digit
array. So if you want to change the sign of an array of size 2 you need to set ob_size
to -2.
The comment from the source code describes it as follows:
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
CAUTION: Generic code manipulating subtypes of PyVarObject has to aware that integers abuse ob_size's sign bit.
*/
The representation of 123456789101112131415 will look as follows:
ob_size | 3 | ||
ob_digit | 437976919 | 87719511 | 107 |
Algorithm of converting our representation back:
Optimization of commonly-used integers
Small integer objects in a range of -5 to 256 are always pre-allocated during initialization. Because Python integers are immutable, we can use them as singletons. Every time you need to create small integer instead of creating new object Python just points to already allocated one. Thereby it saves a lot of space and computation for commonly-used integers.
Interestingly enough, the PyLongObject
structure takes at least 28 bytes for every allocated integer and therefore takes three times as much memory as a simple 64-bit C integer.
Number system conversion
Let's see how Python converts integers to an array.
We are going to look at the simple example where we need to convert a long unsigned 64-bit integer to Python's integer representation. Note that it's the biggest possible standard C type that can be converted to our PyLongObject
directly. Larger numbers can be converted from strings or byte arrays.
If you are not familiar with number system conversion, you can read this tutorial.
Here is a simplified algorithm translated from C to Python:
SHIFT = 30 # number of bits for each 'digit'
MASK = (2 ** SHIFT)
bignum = 18446744073709551615
def split_number(bignum):
t = abs(bignum)
num_list = []
while t != 0:
# Get remainder from division
small_int = t % MASK # more efficient bitwise analogue: (t & (MASK-1))
num_list.append(small_int)
# Get integral part of the division (floor division)
t = t // MASK # more efficient bitwise analogue: t >>= SHIFT
return num_list
def restore_number(num_list):
bignum = 0
for i, n in enumerate(num_list):
bignum += n * (2 ** (SHIFT * i))
return bignum
num_list = split_number(bignum)
assert bignum == restore_number(num_list)
To verify our solution we can also take a look at the internal representation:
import ctypes
class PyLongObject(ctypes.Structure):
_fields_ = [("ob_refcnt", ctypes.c_long),
("ob_type", ctypes.c_void_p),
("ob_size", ctypes.c_ulong),
("ob_digit", ctypes.c_uint * 3)]
bignum = 18446744073709551615
for d in PyLongObject.from_address(id(bignum)).ob_digit:
print(d)
Arithmetic
Basic arithmetic operations are implemented using simple school math with only one difference — every element of an array is treated like a single "digit".
Every operation results in a minimum one object creation. For example, when you want to execute c += 10
roughly these steps are performed:
- Get an address of an already allocated object for
10
(because it's a small number we don't have to create a new object, see above). - Create a new integer object so we can store the result of addition in it.
- Add
c
and10
together. Store the result in the newly created object. - Replace variable
c
with reference to our new object. - Decrease the reference count of an old
c
so garbage collector can destroy this object in the future.
For example, let's take a look at addition, where carrying is used:
def add_bignum(a, b):
z = []
if len(a) < len(b):
# Ensure a is the larger of the two
a, b = b, a
carry = 0
for i in range(0, len(b)):
carry += a[i] + b[i]
z.append(carry % MASK)
carry = carry // MASK
for i in range(i + 1, len(a)):
carry += a[i]
z.append(carry % MASK)
carry = carry // MASK
z.append(carry)
# remove trailing zeros
i = len(z)
while i > 0 and z[i-1] == 0:
i -= 1
z = z[0:i]
return z
a = 8223372036854775807
b = 100037203685477
assert restore_number(add_bignum(split_number(a), split_number(b))) == a + b
Diving deeper
There are a lot of small details that can't be described in one article. You can read more about integers in CPython (1, 2, 3) source code.
I have plans to continue this series, in the next posts I will describe internals of class and function objects.
Comments
- Artem 2017-09-23 #
Thanks, I've fixed the bug. Original implementation uses subtraction function for negative addition.
- mingrammer 2017-09-23 #
Thanks for great article.
So, I wanna translate it into korean on my personal blog
Can I do it? :) (Surely, I will link this original post)
Thanks
- motleytech 2017-09-24 #
Good article.
It might be interesting to note the Karatsuba multiplication method which is implemented for these integers. Its much faster than naive multiplication (when dealing with large numbers).
- Régis 2017-10-04 #
There's a small typo: "So if you want to change the sign of an array of size 2 you need to set ob_digit to -2" should read "So if you want to change the sign of an array of size 2 you need to set ob_size to -2"
- Влад Токарев 2018-07-20 #
Great! It's very useful kind of articles that about internals of some software... Thank you! Write else! :-)
- debashish palit 2018-07-30 #
Nice article. One thing I didn't get is the "remove trailing zeros" bit in
add_bignum
function. Going by your Python code,carry
can be either 0 or 1. So, only one trailing zero may be present. So, is thewhile
loop needed?- Artem 2018-08-02 #
Good catch! In this case yes, it's not needed. Original C code uses a helper function with a while loop, but it's also used in other functions where more than 1 zeros can happen.
- roachsinai 2019-04-04 #
Thanks for this very helpful post! But could you tell me more detail about "To eliminate unnecessary computation CPython has a "fast path" implementation for integers in a range of 2^(-30) - 2^30? In my mind python only story integer -5~257.
I found a bug in the last line of function
add_bignum
. The final assert fails for example for numbersa=2**30 - 1; b=1
. It should also be noted that this function has been simplified to work only for non-negative numbers.