Python internals: Arbitrary-precision integer implementation

Last updated on September 19, 2017, in python

Have you ever noticed that Python supports integers of any size? Here is a quick glance at it.

Python represents all objects by C structures. The following data structure is responsible for all integer objects:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
} PyLongObject;

After macros expansions, a simplified version of the struct looks as follows:

struct {
    ssize_t ob_refcnt;
    struct _typeobject *ob_type;
    ssize_t ob_size; 
    uint32_t ob_digit[1];
};

The ob_refcnt field is responsible for reference counting technique which is used in garbage collection mechanism, whereas ob_type is a pointer to a structure which describes an integer type.

Generally, In languages like C/C++, the precision of integers is limited to 64-bit, but Python has built-in support for Arbitrary-precision integers. Since Python 3 there is no longer simple integer type, and all integers are represented as a bignum.

How to store arbitrarily large integer?

One of the solutions is to represent the integer as an array of digits. To do it in an efficient way we need to convert our number from base 10 to base $2^{30}$ numeral system, where each element represents a single 'digit' in a range from $0$ to $2^{30}-1$. Depending on the platform, Python uses either 32-bit unsigned integer arrays with 30-bit digits or 16-bit unsigned integer arrays with 15-bit digits. Using such approach introduces additional requirements, that's why we can't use all bits of the integer. The ob_digit field in a structure above is responsible for such arrays.

To eliminate unnecessary computation CPython has a "fast path" implementation for integers in a range of $-2^{30}$ to $2^{30}$. Such integers are stored as an array of one element and if it's possible treated as fixed 32-bit integers.

Note that, unlike classical approach, the sign of an integer is stored separately in ob_size field. This field stores the size of the ob_digit array. So if you want to change the sign of an array of size 2 you need to set ob_size to -2.

The comment from the source code describes it as follows:

/* Long integer representation.
   The absolute value of a number is equal to
    SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
   Negative numbers are represented with ob_size < 0;
   zero is represented by ob_size == 0.
   In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
   digit) is never zero.  Also, in all cases, for all valid i,
    0 <= ob_digit[i] <= MASK.
   The allocation function takes care of allocating extra memory
   so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.

   CAUTION:  Generic code manipulating subtypes of PyVarObject has to aware that integers abuse  ob_size's sign bit.
*/

The representation of 123456789101112131415 will look as follows:

ob_size	3
ob_digit	437976919	87719511	107

Algorithm of converting our representation back:

$ (437976919 * 2^{30 * 0}) + (87719511 * 2^{30 * 1}) + (107 * 2^{30 * 2})$

Optimization of commonly-used integers

Small integer objects in a range of -5 to 256 are always pre-allocated during initialization. Because Python integers are immutable, we can use them as singletons. Every time you need to create small integer instead of creating new object Python just points to already allocated one. Thereby it saves a lot of space and computation for commonly-used integers.

Interestingly enough, the PyLongObject structure takes at least 28 bytes for every allocated integer and therefore takes three times as much memory as a simple 64-bit C integer.

Number system conversion

Let's see how Python converts integers to an array.

We are going to look at the simple example where we need to convert a long unsigned 64-bit integer to Python's integer representation. Note that it's the biggest possible standard C type that can be converted to our PyLongObject directly. Larger numbers can be converted from strings or byte arrays.

If you are not familiar with number system conversion, you can read this tutorial.

Here is a simplified algorithm translated from C to Python:

SHIFT = 30  # number of bits for each 'digit'
MASK = (2 ** SHIFT)
bignum = 18446744073709551615

def split_number(bignum):
    t = abs(bignum)

    num_list = []
    while t != 0:
        # Get remainder from division
        small_int = t % MASK  # more efficient bitwise analogue: (t & (MASK-1))
        num_list.append(small_int)

        # Get integral part of the division (floor division)
        t = t // MASK  # more efficient bitwise analogue: t >>= SHIFT

    return num_list

def restore_number(num_list):
    bignum = 0
    for i, n in enumerate(num_list):
        bignum += n * (2 ** (SHIFT * i))
    return bignum

num_list = split_number(bignum)
assert bignum == restore_number(num_list)

To verify our solution we can also take a look at the internal representation:

import ctypes

class PyLongObject(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type", ctypes.c_void_p),
                ("ob_size", ctypes.c_ulong),
                ("ob_digit", ctypes.c_uint * 3)]


bignum = 18446744073709551615

for d in PyLongObject.from_address(id(bignum)).ob_digit:
    print(d)

Arithmetic

Basic arithmetic operations are implemented using simple school math with only one difference — every element of an array is treated like a single "digit".

Every operation results in a minimum one object creation. For example, when you want to execute c += 10 roughly these steps are performed:

Get an address of an already allocated object for 10 (because it's a small number we don't have to create a new object, see above).
Create a new integer object so we can store the result of addition in it.
Add c and 10 together. Store the result in the newly created object.
Replace variable c with reference to our new object.
Decrease the reference count of an old c so garbage collector can destroy this object in the future.

For example, let's take a look at addition, where carrying is used:

def add_bignum(a, b):
    z = []

    if len(a) < len(b):
        # Ensure a is the larger of the two
        a, b = b, a

    carry = 0

    for i in range(0, len(b)):
        carry += a[i] + b[i]
        z.append(carry % MASK)
        carry = carry // MASK

    for i in range(i + 1, len(a)):
        carry += a[i]
        z.append(carry % MASK)
        carry = carry // MASK

    z.append(carry)

    # remove trailing zeros
    i = len(z)
    while i > 0 and z[i-1] == 0:
        i -= 1
    z = z[0:i]

    return z


a = 8223372036854775807
b = 100037203685477
assert restore_number(add_bignum(split_number(a), split_number(b))) == a + b

Diving deeper

There are a lot of small details that can't be described in one article. You can read more about integers in CPython (1, 2, 3) source code.

I have plans to continue this series, in the next posts I will describe internals of class and function objects.

If you have any questions, feel free to ask them via e-mail displayed in the footer.

python , cpython internals, advanced python

Hynek Černoch 2017-09-23 #
I found a bug in the last line of function add_bignum. The final assert fails for example for numbers a=2**30 - 1; b=1. It should also be noted that this function has been simplified to work only for non-negative numbers.
reply
- Artem 2017-09-23 #
  Thanks, I've fixed the bug. Original implementation uses subtraction function for negative addition.
  reply

mingrammer 2017-09-23 #
Thanks for great article.
So, I wanna translate it into korean on my personal blog
Can I do it? :) (Surely, I will link this original post)
Thanks
reply
- Artem 2017-09-23 #
  Yes, you can. It would be cool. Just mention original post in your article.
  reply

motleytech 2017-09-24 #
Good article.
It might be interesting to note the Karatsuba multiplication method which is implemented for these integers. Its much faster than naive multiplication (when dealing with large numbers).
reply

Régis 2017-10-04 #
There's a small typo: "So if you want to change the sign of an array of size 2 you need to set ob_digit to -2" should read "So if you want to change the sign of an array of size 2 you need to set ob_size to -2"
reply

Влад Токарев 2018-07-20 #
Great! It's very useful kind of articles that about internals of some software... Thank you! Write else! :-)
reply
- Artem 2018-07-20 #
  Thanks!
  reply

debashish palit 2018-07-30 #
Nice article. One thing I didn't get is the "remove trailing zeros" bit in add_bignum function. Going by your Python code, carry can be either 0 or 1. So, only one trailing zero may be present. So, is the while loop needed?
reply
- Artem 2018-08-02 #
  Good catch! In this case yes, it's not needed. Original C code uses a helper function with a while loop, but it's also used in other functions where more than 1 zeros can happen.
  https://github.com/python/cpython/blob/72ec3193b5118a2ccc8be8bf03d7b74691c6a264/Objects/longobject.c#L126
  reply

roachsinai 2019-04-04 #
Thanks for this very helpful post! But could you tell me more detail about "To eliminate unnecessary computation CPython has a "fast path" implementation for integers in a range of 2^(-30) - 2^30? In my mind python only story integer -5~257.
reply
- Artem 2019-04-04 #
  By fast path implementation I mean that math operations on the such numbers are calculated directly on the CPU. There is not need to use the bignum algorithms when working with them.
  reply
  roachsinai 2019-04-05 #
  Thanks for your reply.
  reply

Python internals: Arbitrary-precision integer implementation

How to store arbitrarily large integer?

Optimization of commonly-used integers

Number system conversion

Arithmetic

Diving deeper

Recent posts in Python category

October 21, 2020

On code isolation in Python

August 24, 2020

How to turn an ordinary gzip archive into a database

April 28, 2019

Detecting SQL injections in Python code using AST

August 09, 2018

How Python saves memory when storing strings

June 29, 2018

How virtual environment libraries work in Python

Comments