How Python saves memory when storing strings
Since Python 3, the str
type uses Unicode representation. Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.
To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:
- 1 byte per char (Latin-1 encoding)
- 2 bytes per char (UCS-2 encoding)
- 4 bytes per char (UCS-4 encoding)
When programming in Python all strings behave the same, and most of the time we don't notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.
To see the difference in internal representations, we can use the sys.getsizeof
function, which returns the size of an object in bytes:
>>> import sys
>>> string = 'hello'
>>> sys.getsizeof(string)
54
>>> # 1-byte encoding
>>> sys.getsizeof(string+'!')-sys.getsizeof(string)
1
>>> # 2-byte encoding
>>> string2 = '你'
>>> sys.getsizeof(string2+'好')-sys.getsizeof(string2)
2
>>> sys.getsizeof(string2)
76
>>> # 4-byte encoding
>>> string3 = '🐍'
>>> sys.getsizeof(string3+'💻')-sys.getsizeof(string3)
4
>>> sys.getsizeof(string3)
80
As you can see, depending on the content of a string, Python uses different encodings. Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That's why an empty string takes 49 bytes of memory.
We can retrieve encoding directly from an object using ctypes
:
import ctypes
class PyUnicodeObject(ctypes.Structure):
# internal fields of the string object
_fields_ = [("ob_refcnt", ctypes.c_long),
("ob_type", ctypes.c_void_p),
("length", ctypes.c_ssize_t),
("hash", ctypes.c_ssize_t),
("interned", ctypes.c_uint, 2),
("kind", ctypes.c_uint, 3),
("compact", ctypes.c_uint, 1),
("ascii", ctypes.c_uint, 1),
("ready", ctypes.c_uint, 1),
# ...
# ...
]
def get_string_kind(string):
return PyUnicodeObject.from_address(id(string)).kind
>>> get_string_kind('Hello')
1
>>> get_string_kind('你好')
2
>>> get_string_kind('🐍')
4
If all characters in a string can fit in ASCII range, then they are encoded using 1-byte Latin-1 encoding. Basically, Latin-1 represents the first 256 Unicode characters. It supports many Latin languages, such as English, Swedish, Italian, Norwegian and so on. However, it cannot store non-Latin languages, such as Chinese, Japanese, Hebrew, Cyrillic. That is because their codepoints (numerical indexes) defined outside of 1-byte (0-255) range.
>>> ord('a')
97
>>> ord('你')
20320
>>> ord('!')
33
Most of the popular natural languages can fit in 2-byte (UCS-2) encoding. The 4-byte (UCS-4) encoding is used when a string contains special symbols, emojis or rare languages. There are almost 300 blocks (ranges) in the Unicode standard. You can find the 4-byte blocks after the 0xFFFF block.
Let's suppose we have a 10GB of ASCII text and we want to load it in memory. If you insert a single emoji in our text the size of a string will increase by the factor of 4! This is a huge difference that you may encounter in practice when working with NLP problems.
Why Python doesn't use UTF-8 encoding internally
The most well-known and popular Unicode encoding is UTF-8, but Python doesn't use it internally.
When a string is stored in the UTF-8 encoding, each character is encoded using 1-4 bytes depending on the character it is representing. It's a storage efficient encoding, but it has one significant disadvantage. Since each character can vary in length of bytes, there is no way to randomly access an individual character by index without scanning the string. So, to perform a simple operation such as string[5]
with UTF-8 Python would need to scan a string until it finds a required character. Fixed length encodings don't have such problem, to locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes).
String interning
When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.
>>> a = 'hello'
>>> b = 'world'
>>> a[4],b[1]
('o', 'o')
>>> id(a[4]), id(b[1]), a[4] is b[1]
(4567926352, 4567926352, True)
>>> id('')
4545673904
>>> id('')
4545673904
As you can see, both string slices point to the same address in the memory. It's possible because Python strings are immutable.
In Python, string interning is not limed to characters or empty strings. Strings that are created during code compilation can also be interned if their length does not exceed 20 characters.
This includes:
- function and class names
- variable names
- argument names
- constants (all strings that are defined in the code)
- keys of dictionaries
- names of attributes
When you hit enter in Python REPL, your statement gets compiled down to the bytecode. That's why all short strings in REPL are also interned.
>>> a = 'teststring'
>>> b = 'teststring'
>>> id(a), id(b), a is b
(4569487216, 4569487216, True)
>>> a = 'test'*5
>>> b = 'test'*5
>>> len(a), id(a), id(b), a is b
(20, 4569499232, 4569499232, True)
>>> a = 'test'*6
>>> b = 'test'*6
>>> len(a), id(a), id(b), a is b
(24, 4569479328, 4569479168, False)
This example will not work, because such strings are not constants:
>>> open('test.txt','w').write('hello')
5
>>> open('test.txt','r').read()
'hello'
>>> a = open('test.txt','r').read()
>>> b = open('test.txt','r').read()
>>> id(a), id(b), a is b
(4384934576, 4384934688, False)
>>> len(a), id(a), id(b), a is b
(5, 4384934576, 4384934688, False)
String interning technique saves tens of thousands of duplicate string allocations. Internally, string interning is maintained by a global dictionary where strings are used as keys. To check if there is already an identical string in the memory Python performs dictionary membership operation.
The unicode object is almost 16 000 lines of C code, so there are a lot of small optimizations which are not mentioned in this article. If you want to learn more about Unicode in Python, I would recommend you to read PEPs about strings and check the code of the unicode object.
Comments
- Anonymous 2019-12-15 #
I feel there are two contradicting statements here. You said that inserting a single emoji into a text of size 10GB of ASCII will increase the size by a factor of 4. But in Python each character is encoded using 1-4 bytes depending on the character it is representing. So ideally that emoji character alone should be encoded using 4 bytes but not whole 10GB text. So how inserting a single emoji increase text size by a factor of 4?
- Artem 2019-12-15 #
That happens because Python will use a single character encoding for the whole string when loading it in one variable. You can't mix them since you want an ability to index or scan a large string quickly.
One emoji forces Python to use four bytes for each character. Because of that, Python takes constant time to access a random index, e.g.
string[10000]
.
- Ameer 2021-06-11 #
There is an error in the second last code block. Should have been true as well This is what I got
len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)
- Ameer 2021-06-11 #
There is an error in the second last code block. Should have been true as well
This is what I got
a = 'test'6 b = 'test'6 len(a), id(a), id(b), a is b (24, 139946711810096, 139946711810096, True)
- Tharunika 2021-11-21 #
S= "Hello world" Print (S.count("") It was printing 12 as output how?.. can anyone explain? If I specify space between quotes it is giving 1 as output.. If we don't specify space it was giving 12 as output
- Rajat 2024-02-03 #
because there is empty string between each character and before the first and after the last character. therefore total 12
- Sergi 2022-03-05 #
Thanks for the post. I got a few things clear and useful that I did not get after reading a few other posts.
- Guy 2022-06-20 #
thanks great article! can you please explain the following:
python a = "shai" b = "shai" print(a is b) # True
python a = "shai!" b = "shai!" print(a is b) # False
- Guy 2022-06-20 #
Yes but why does the '!' cause this different behavior as it's one of the ascii characters and should be stored the same?
- Guy 2022-10-20 #
But the below code kinda contradicts this model of string being stored as a contiguous block of characters, but more like an array of references/pointers to individual characters since o in both the strings point to the same object
s1 = "hello" s2 = "world" id(s1[4]) 140195535215024 id(s2[1]) 140195535215024
So, should I see string as an array of characters or array of references to character objects?
- Pseudo 2024-01-30 #
I have a one doubt,
``` import sys
print(sys.getsizeof("hello world hello world"))
72 print(sys.getsizeof(["hello world hello world"]) 64 print(sys.getsizeof(("hello world hello world",))) 48
Why this happens? list or tuple contains low space as compare to string?
- Anshul 2024-04-23 #
Fantastic Post! I have 2 doubts here, i will really appreciate if someone could help. 1. How indexing works in python? How is it different from other languages like C++? e.g. we store a = 'hello world' and b = 'good morning' now as id(a[4]) and id(b[1]) is same. that means h, e, l,... are not in continuity here. then how does python strings decide which character to pick?
- if i have two string a = 'hello world' b = 'hello world'. why the id's of both a and b are not same.
thanks in advance
- Dan 2024-04-28 #
Python doesn't have a character type, just string. When you do
a[4]
, you are creating a new temporary string"o"
. Same forb[1]
. There are 2 things that may happen here a/ Since there is no reference toa[4]
, it gets garbage collected, and b[1] is allocated to the freed memory. b/ More likely, the temporary string"o"
is interned, sob[1]
just points to the same string asa[4]
. Either way, that results in id(a[4]) and id(b[1]) being the same.In the case where you have a = 'hello world' and b = 'hello world', since string interning only works with small strings, a and b may be two different strings stored in different places in the memory. Hence the possibly different id's.
Nice article! Can I transfer it to Chinese with a source link?
Looking forward to your reply.