How Python saves memory when storing strings
Since Python 3, the
str type uses Unicode representation. Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.
To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:
- 1 byte per char (Latin-1 encoding)
- 2 bytes per char (UCS-2 encoding)
- 4 bytes per char (UCS-4 encoding)
When programming in Python all strings behave the same, and most of the time we don't notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.
To see the difference in internal representations, we can use the
sys.getsizeof function, which returns the size of an object in bytes:
>>> import sys >>> string = 'hello' >>> sys.getsizeof(string) 54 >>> # 1-byte encoding >>> sys.getsizeof(string+'!')-sys.getsizeof(string) 1 >>> # 2-byte encoding >>> string2 = '你' >>> sys.getsizeof(string2+'好')-sys.getsizeof(string2) 2 >>> sys.getsizeof(string2) 76 >>> # 4-byte encoding >>> string3 = '🐍' >>> sys.getsizeof(string3+'💻')-sys.getsizeof(string3) 4 >>> sys.getsizeof(string3) 80
As you can see, depending on the content of a string, Python uses different encodings. Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That's why an empty string takes 49 bytes of memory.
We can retrieve encoding directly from an object using
import ctypes class PyUnicodeObject(ctypes.Structure): # internal fields of the string object _fields_ = [("ob_refcnt", ctypes.c_long), ("ob_type", ctypes.c_void_p), ("length", ctypes.c_ssize_t), ("hash", ctypes.c_ssize_t), ("interned", ctypes.c_uint, 2), ("kind", ctypes.c_uint, 3), ("compact", ctypes.c_uint, 1), ("ascii", ctypes.c_uint, 1), ("ready", ctypes.c_uint, 1), # ... # ... ] def get_string_kind(string): return PyUnicodeObject.from_address(id(string)).kind
>>> get_string_kind('Hello') 1 >>> get_string_kind('你好') 2 >>> get_string_kind('🐍') 4
If all characters in a string can fit in ASCII range, then they are encoded using 1-byte Latin-1 encoding. Basically, Latin-1 represents the first 256 Unicode characters. It supports many Latin languages, such as English, Swedish, Italian, Norwegian and so on. However, it cannot store non-Latin languages, such as Chinese, Japanese, Hebrew, Cyrillic. That is because their codepoints (numerical indexes) defined outside of 1-byte (0-255) range.
>>> ord('a') 97 >>> ord('你') 20320 >>> ord('!') 33
Most of the popular natural languages can fit in 2-byte (UCS-2) encoding. The 4-byte (UCS-4) encoding is used when a string contains special symbols, emojis or rare languages. There are almost 300 blocks (ranges) in the Unicode standard. You can find the 4-byte blocks after the 0xFFFF block.
Let's suppose we have a 10GB of ASCII text and we want to load it in memory. If you insert a single emoji in our text the size of a string will increase by the factor of 4! This is a huge difference that you may encounter in practice when working with NLP problems.
Why Python doesn't use UTF-8 encoding internally
The most well-known and popular Unicode encoding is UTF-8, but Python doesn't use it internally.
When a string is stored in the UTF-8 encoding, each character is encoded using 1-4 bytes depending on the character it is representing. It's a storage efficient encoding, but it has one significant disadvantage. Since each character can vary in length of bytes, there is no way to randomly access an individual character by index without scanning the string. So, to perform a simple operation such as
string with UTF-8 Python would need to scan a string until it finds a required character. Fixed length encodings don't have such problem, to locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes).
When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.
>>> a = 'hello' >>> b = 'world' >>> a,b ('o', 'o') >>> id(a), id(b), a is b (4567926352, 4567926352, True) >>> id('') 4545673904 >>> id('') 4545673904
As you can see, both string slices point to the same address in the memory. It's possible because Python strings are immutable.
In Python, string interning is not limed to characters or empty strings. Strings that are created during code compilation can also be interned if their length does not exceed 20 characters.
- function and class names
- variable names
- argument names
- constants (all strings that are defined in the code)
- keys of dictionaries
- names of attributes
When you hit enter in Python REPL, your statement gets compiled down to the bytecode. That's why all short strings in REPL are also interned.
>>> a = 'teststring' >>> b = 'teststring' >>> id(a), id(b), a is b (4569487216, 4569487216, True) >>> a = 'test'*5 >>> b = 'test'*5 >>> len(a), id(a), id(b), a is b (20, 4569499232, 4569499232, True) >>> a = 'test'*6 >>> b = 'test'*6 >>> len(a), id(a), id(b), a is b (24, 4569479328, 4569479168, False)
This example will not work, because such strings are not constants:
>>> open('test.txt','w').write('hello') 5 >>> open('test.txt','r').read() 'hello' >>> a = open('test.txt','r').read() >>> b = open('test.txt','r').read() >>> id(a), id(b), a is b (4384934576, 4384934688, False) >>> len(a), id(a), id(b), a is b (5, 4384934576, 4384934688, False)
String interning technique saves tens of thousands of duplicate string allocations. Internally, string interning is maintained by a global dictionary where strings are used as keys. To check if there is already an identical string in the memory Python performs dictionary membership operation.
The unicode object is almost 16 000 lines of C code, so there are a lot of small optimizations which are not mentioned in this article. If you want to learn more about Unicode in Python, I would recommend you to read PEPs about strings and check the code of the unicode object.
- Anonymous 3 years, 3 months ago #
I feel there are two contradicting statements here. You said that inserting a single emoji into a text of size 10GB of ASCII will increase the size by a factor of 4. But in Python each character is encoded using 1-4 bytes depending on the character it is representing. So ideally that emoji character alone should be encoded using 4 bytes but not whole 10GB text. So how inserting a single emoji increase text size by a factor of 4?
- Artem 3 years, 3 months ago #
That happens because Python will use a single character encoding for the whole string when loading it in one variable. You can't mix them since you want an ability to index or scan a large string quickly.
One emoji forces Python to use four bytes for each character. Because of that, Python takes constant time to access a random index, e.g.
- Ameer 1 year, 9 months ago #
There is an error in the second last code block. Should have been true as well This is what I got
len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)
- Ameer 1 year, 9 months ago #
There is an error in the second last code block. Should have been true as well
This is what I got
a = 'test'6 b = 'test'6 len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)
- Tharunika 1 year, 4 months ago #
S= "Hello world" Print (S.count("") It was printing 12 as output how?.. can anyone explain? If I specify space between quotes it is giving 1 as output.. If we don't specify space it was giving 12 as output
- Sergi 1 year ago #
Thanks for the post. I got a few things clear and useful that I did not get after reading a few other posts.
- Guy 9 months, 1 week ago #
thanks great article! can you please explain the following:
python a = "shai" b = "shai" print(a is b) # True
python a = "shai!" b = "shai!" print(a is b) # False
- Guy 5 months, 1 week ago #
But the below code kinda contradicts this model of string being stored as a contiguous block of characters, but more like an array of references/pointers to individual characters since o in both the strings point to the same object
s1 = "hello" s2 = "world" id(s1) 140195535215024 id(s2) 140195535215024
So, should I see string as an array of characters or array of references to character objects?
Nice article! Can I transfer it to Chinese with a source link?
Looking forward to your reply.