How Python saves memory when storing strings

Last updated on August 10, 2018, in python

Since Python 3, the str type uses Unicode representation. Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.

To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:

  • 1 byte per char (Latin-1 encoding)
  • 2 bytes per char (UCS-2 encoding)
  • 4 bytes per char (UCS-4 encoding)

When programming in Python all strings behave the same, and most of the time we don't notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.

To see the difference in internal representations, we can use the sys.getsizeof function, which returns the size of an object in bytes:

>>> import sys
>>> string = 'hello'
>>> sys.getsizeof(string)
54
>>> # 1-byte encoding
>>> sys.getsizeof(string+'!')-sys.getsizeof(string)
1
>>> # 2-byte encoding
>>> string2  = '你'
>>> sys.getsizeof(string2+'好')-sys.getsizeof(string2)
2
>>> sys.getsizeof(string2)
76
>>> # 4-byte encoding
>>> string3 = '🐍'
>>> sys.getsizeof(string3+'💻')-sys.getsizeof(string3)
4
>>> sys.getsizeof(string3)
80

As you can see, depending on the content of a string, Python uses different encodings. Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That's why an empty string takes 49 bytes of memory.

We can retrieve encoding directly from an object using ctypes:

import ctypes

class PyUnicodeObject(ctypes.Structure):
    # internal fields of the string object
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type", ctypes.c_void_p),
                ("length", ctypes.c_ssize_t),
                ("hash", ctypes.c_ssize_t),
                ("interned", ctypes.c_uint, 2),
                ("kind", ctypes.c_uint, 3),
                ("compact", ctypes.c_uint, 1),
                ("ascii", ctypes.c_uint, 1),
                ("ready", ctypes.c_uint, 1),
                # ...
                # ...
                ]


def get_string_kind(string):
    return PyUnicodeObject.from_address(id(string)).kind
>>> get_string_kind('Hello')
1
>>> get_string_kind('你好')
2
>>> get_string_kind('🐍')
4

If all characters in a string can fit in ASCII range, then they are encoded using 1-byte Latin-1 encoding. Basically, Latin-1 represents the first 256 Unicode characters. It supports many Latin languages, such as English, Swedish, Italian, Norwegian and so on. However, it cannot store non-Latin languages, such as Chinese, Japanese, Hebrew, Cyrillic. That is because their codepoints (numerical indexes) defined outside of 1-byte (0-255) range.

>>> ord('a')
97
>>> ord('你')
20320
>>> ord('!')
33

Most of the popular natural languages can fit in 2-byte (UCS-2) encoding. The 4-byte (UCS-4) encoding is used when a string contains special symbols, emojis or rare languages. There are almost 300 blocks (ranges) in the Unicode standard. You can find the 4-byte blocks after the 0xFFFF block.

Let's suppose we have a 10GB of ASCII text and we want to load it in memory. If you insert a single emoji in our text the size of a string will increase by the factor of 4! This is a huge difference that you may encounter in practice when working with NLP problems.

Why Python doesn't use UTF-8 encoding internally

The most well-known and popular Unicode encoding is UTF-8, but Python doesn't use it internally.

When a string is stored in the UTF-8 encoding, each character is encoded using 1-4 bytes depending on the character it is representing. It's a storage efficient encoding, but it has one significant disadvantage. Since each character can vary in length of bytes, there is no way to randomly access an individual character by index without scanning the string. So, to perform a simple operation such as string[5] with UTF-8 Python would need to scan a string until it finds a required character. Fixed length encodings don't have such problem, to locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes).

String interning

When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.

>>> a = 'hello'
>>> b = 'world'
>>> a[4],b[1]
('o', 'o')
>>> id(a[4]), id(b[1]), a[4] is b[1]
(4567926352, 4567926352, True)
>>> id('')
4545673904
>>> id('')
4545673904

As you can see, both string slices point to the same address in the memory. It's possible because Python strings are immutable.

In Python, string interning is not limed to characters or empty strings. Strings that are created during code compilation can also be interned if their length does not exceed 20 characters.

This includes:

  • function and class names
  • variable names
  • argument names
  • constants (all strings that are defined in the code)
  • keys of dictionaries
  • names of attributes

When you hit enter in Python REPL, your statement gets compiled down to the bytecode. That's why all short strings in REPL are also interned.

>>> a = 'teststring'
>>> b = 'teststring'
>>> id(a), id(b), a is b
(4569487216, 4569487216, True)
>>> a = 'test'*5
>>> b = 'test'*5
>>> len(a), id(a), id(b), a is b
(20, 4569499232, 4569499232, True)
>>> a = 'test'*6
>>> b = 'test'*6
>>> len(a), id(a), id(b), a is b
(24, 4569479328, 4569479168, False)

This example will not work, because such strings are not constants:

>>> open('test.txt','w').write('hello')
5
>>> open('test.txt','r').read()
'hello'
>>> a = open('test.txt','r').read()
>>> b = open('test.txt','r').read()
>>> id(a), id(b), a is b
(4384934576, 4384934688, False)
>>> len(a), id(a), id(b), a is b
(5, 4384934576, 4384934688, False)

String interning technique saves tens of thousands of duplicate string allocations. Internally, string interning is maintained by a global dictionary where strings are used as keys. To check if there is already an identical string in the memory Python performs dictionary membership operation.

The unicode object is almost 16 000 lines of C code, so there are a lot of small optimizations which are not mentioned in this article. If you want to learn more about Unicode in Python, I would recommend you to read PEPs about strings and check the code of the unicode object.


If you have any questions, feel free to ask them via e-mail displayed in the footer.

Comments

  • Kevin Bai 2018-08-20 #

    Nice article! Can I transfer it to Chinese with a source link?
    Looking forward to your reply.

    reply

    • Artem 2018-08-20 #

      Sure, no problem.

      reply

      • Kevin Bai 2018-08-20 #

        Yeah, thanks!

        reply

  • Anonymous 2019-12-15 #

    I feel there are two contradicting statements here. You said that inserting a single emoji into a text of size 10GB of ASCII will increase the size by a factor of 4. But in Python each character is encoded using 1-4 bytes depending on the character it is representing. So ideally that emoji character alone should be encoded using 4 bytes but not whole 10GB text. So how inserting a single emoji increase text size by a factor of 4?

    reply

    • Artem 2019-12-15 #

      That happens because Python will use a single character encoding for the whole string when loading it in one variable. You can't mix them since you want an ability to index or scan a large string quickly.

      One emoji forces Python to use four bytes for each character. Because of that, Python takes constant time to access a random index, e.g. string[10000].

      reply

      • Anonymous 2019-12-24 #

        That explains well. Thank you

        reply

        • Jorge 2020-11-03 #

          It looks like you would be a fan of UTF-8. It does exactly what you're wishing for. ("emoji character alone should be encoded using 4 bytes but not whole 10GB text.")

  • drizzlex 2020-01-03 #

    Excellent post. I am glad to learn something this morning.

    reply

  • mrsmith 2020-05-26 #

    Fantastic post! Thanks for sharing

    reply

  • Sia 2020-09-13 #

    Really Cool . Thanks for well detailed explanation.

    reply

  • Ameer 2021-06-11 #

    There is an error in the second last code block. Should have been true as well This is what I got

    len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)

    reply

  • Ameer 2021-06-11 #

    There is an error in the second last code block. Should have been true as well

    This is what I got

    a = 'test'6 b = 'test'6 len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)

    reply

  • Tharunika 2021-11-21 #

    S= "Hello world" Print (S.count("") It was printing 12 as output how?.. can anyone explain? If I specify space between quotes it is giving 1 as output.. If we don't specify space it was giving 12 as output

    reply

    • Rajat 2024-02-03 #

      because there is empty string between each character and before the first and after the last character. therefore total 12

      reply

  • Sergi 2022-03-05 #

    Thanks for the post. I got a few things clear and useful that I did not get after reading a few other posts.

    reply

  • Guy 2022-06-20 #

    thanks great article! can you please explain the following:

    python a = "shai" b = "shai" print(a is b) # True

    python a = "shai!" b = "shai!" print(a is b) # False

    reply

    • Artem 2022-06-20 #

      Python interns short strings.

      reply

      • Guy 2022-06-20 #

        Yes but why does the '!' cause this different behavior as it's one of the ascii characters and should be stored the same?

        reply

        • Artem 2022-06-21 #

          There is a check for extra characters. https://github.com/python/cpython/blob/1603a1029f44f0fdc87c65b02063229962194f84/Objects/codeobject.c#L21

          • Guy 2022-06-21 #

            Thanks! much appreciated!

  • Guy 2022-10-20 #

    But the below code kinda contradicts this model of string being stored as a contiguous block of characters, but more like an array of references/pointers to individual characters since o in both the strings point to the same object

    s1 = "hello" s2 = "world" id(s1[4]) 140195535215024 id(s2[1]) 140195535215024

    So, should I see string as an array of characters or array of references to character objects?

    reply

    • Artem 2022-10-21 #

      No, that happens because any string slice produces a new substring.

      reply

  • Pseudo 2024-01-30 #

    I have a one doubt,

    ``` import sys

    print(sys.getsizeof("hello world hello world"))

    72 print(sys.getsizeof(["hello world hello world"]) 64 print(sys.getsizeof(("hello world hello world",))) 48

    Why this happens? list or tuple contains low space as compare to string?

    reply

  • Anshul 2024-04-23 #

    Fantastic Post! I have 2 doubts here, i will really appreciate if someone could help. 1. How indexing works in python? How is it different from other languages like C++? e.g. we store a = 'hello world' and b = 'good morning' now as id(a[4]) and id(b[1]) is same. that means h, e, l,... are not in continuity here. then how does python strings decide which character to pick?

    1. if i have two string a = 'hello world' b = 'hello world'. why the id's of both a and b are not same.

    thanks in advance

    reply

    • Dan 2024-04-28 #

      Python doesn't have a character type, just string. When you do a[4], you are creating a new temporary string "o". Same for b[1]. There are 2 things that may happen here a/ Since there is no reference to a[4], it gets garbage collected, and b[1] is allocated to the freed memory. b/ More likely, the temporary string "o" is interned, so b[1] just points to the same string as a[4]. Either way, that results in id(a[4]) and id(b[1]) being the same.

      In the case where you have a = 'hello world' and b = 'hello world', since string interning only works with small strings, a and b may be two different strings stored in different places in the memory. Hence the possibly different id's.

      reply