How ClickHouse handles strings
At my work, we use ClickHouse to process billions of records and hundreds of terabytes of data. ClickHouse is fast, and its speed got me curious to learn some of its internals.
Let's look at a few queries:
SELECT count(*)
FROM cluster.feed
WHERE state = 'unknown'
┌──────count()─┐
1. │ 129375618342 │ -- 129.38 billion
└──────────────┘
1 row in set. Elapsed: 17.030 sec.
Processed 222.08 billion rows, 392.14 GB (13.04 billion rows/s., 23.03 GB/s.)
Peak memory usage: 5.10 MiB.
SELECT count(*)
FROM cluster.feed
WHERE data = 'string_of_46_characters'
┌─count()─┐
1. │ 17047 │
└─────────┘
1 row in set. Elapsed: 202.158 sec.
Processed 222.11 billion rows, 38.47 TB (1.10 billion rows/s., 190.31 GB/s.)
Peak memory usage: 1.24 GiB.
It takes 17 seconds to process 220 billion records and filter one column by a string value on a 6-machine cluster.
The state field is not indexed and has a few thousand unique values.
ClickHouse is a columnar database, and there are no classic indexes as you would expect in almost
any relational database. In the schema, the state field is defined as LowCardinality(String), which hints
ClickHouse that this field has a limited amount of unique values.
The second query filters by a 46 character long string on the data field.
It take way more time, but scans 38.47 TB of data in 3 minutes. Each value can store up to 512 KB of data.
That’s why it takes longer, but it's still pretty fast considering the amount of data scanned without any indexes.
In this article, I will explain some of the implementation details behind string handling in ClickHouse.
String comparison
String comparison is a fundamental operation in any database.
Internally, ClickHouse categorizes strings into three categories:
- Short strings (less than 16 bytes)
- Medium strings (less than 64 bytes but larger than 16 bytes)
- Long strings (64 bytes and larger)
Short strings
Shorts strings are compared in a clever way:
/** The order of branches and the trick with overlapping comparisons
* are the same as in memcpy implementation.
* See the comments in base/glibc-compatibility/memcpy/memcpy.h
*/
if (size <= 16)
{
if (size >= 8)
{
/// Chunks of 8..16 bytes.
return unalignedLoad<uint64_t>(p1) == unalignedLoad<uint64_t>(p2)
&& unalignedLoad<uint64_t>(p1 + size - 8) == unalignedLoad<uint64_t>(p2 + size - 8);
}
if (size >= 4)
{
/// Chunks of 4..7 bytes.
return unalignedLoad<uint32_t>(p1) == unalignedLoad<uint32_t>(p2)
&& unalignedLoad<uint32_t>(p1 + size - 4) == unalignedLoad<uint32_t>(p2 + size - 4);
}
if (size >= 2)
{
/// Chunks of 2..3 bytes.
return unalignedLoad<uint16_t>(p1) == unalignedLoad<uint16_t>(p2)
&& unalignedLoad<uint16_t>(p1 + size - 2) == unalignedLoad<uint16_t>(p2 + size - 2);
}
if (size >= 1)
{
/// A single byte.
return *p1 == *p2;
}
return true;
}
The way the code is structured, may look weird at first. But it's actually optimized for branch predictions so that the CPU can execute it faster.
Depending on the string size, the comparison can result in overlapping memory reads.
For example, let's use a 6-byte string foobar.
The first read will load foob and the second read will load obar.
| Slice | Result | Indices |
|---|---|---|
| Full string | foobar | 0 - 5 |
| Load 1 | foob | 0 - 3 |
| Load 2 | obar | 2 - 5 |
As you can see, the code checks the same characters multiple times. Which is inefficient, right?
As it turns out, different 4-7 byte strings follow the same code path. The CPU's branch predictor sees the same pattern every time, leading to way less mispredictions. On modern CPUs, these two load and compare operations are usually executed in parallel.
Optimizing for fewer misspredictions results in faster comparisons for this particular task.
Medium strings
The code for medium strings looks as follows:
switch (size / 16)
{
case 3:
if (!compare16(p1 + 32, p2 + 32))
return false;
[[fallthrough]];
case 2:
if (!compare16(p1 + 16, p2 + 16))
return false;
[[fallthrough]];
case 1:
if (!compare16(p1, p2))
return false;
[[fallthrough]];
default: ;
}
return compare16(p1 + size - 16, p2 + size - 16);
It looks like some voodoo magic at first.
The fallthrough statement tells the compiler that we want to continue executing the next case.
So, if a string is 40 bytes long, switch (size / 16) will evaluate to 2.
There will be three comparisons:
- Compare bytes 16-31 (switch case 2)
- Compare bytes 0-15 (switch case 1 (fallthrough from case 2))
- Compare bytes 24-39 (final return)
The medium-sized string comparison uses the same overlapping technique as the short strings.
If the target CPU supports SSE2 or Arm Neon, ClickHouse uses SIMD instructions to compare 16 bytes at a time.
The switch statement is used instead of a bunch of ifs on purpose to manually unroll the loop.
Long strings
Comparison of long strings is relatively straight forward:
while (size >= 64)
{
if (compare64(p1, p2))
{
p1 += 64;
p2 += 64;
size -= 64;
}
else
return false;
}
The string is processed in chunks of 64 bytes (or less for the last chunk).
The compare64 function uses SIMD instructions if available. It loads 64 bytes into 8 16-byte SIMD registers (4 for each substring)
and compares them in parallel.
The main speed up here comes from SIMD instructions that allow to compare up to 64 ASCII characters or UTF-8 codepoints at once.
When Clickhouse is running on an architecture that doesn't support SIMD instructions, regular memcmp is used instead.
You can find the complete implementation of string comparison here.
String hashing
In CH, hash maps are used a lot. Even my example query uses multiple aggregations under the hood.
When keys have a string type, the performance of the hash map depends on two key operations: string comparison and hash generation.
For performance-critical parts, ClickHouse uses CRC32 hash when possible. This seems like a weird choice because CRC32 checksum values are not distributed well. It's used because both ARM and SSE have hardware instructions to compute CRC32 very fast. When they are not available, ClickHouse falls back to one of the versions of CityHash64.
LowCardinality strings
As I mentioned before, ClickHouse has a special data type LowCardinality(String).
It's used when you know that the number of unique string values is limited.
It works best when the number of unique values is less than 10 000.
Instead of storing string values directly, ClickHouse creates a dictionary of unique values and maps them to unique integer ids.
Because of that, a lot of operations, such as filtering and grouping, are done on integers instead of strings. That's why the first query in this article is so fast.
Compression
Strings are usually highly compressible. As it turns out, disk IO, even for SSDs, can become a bottleneck.
When the data is compressed using fast algorithms, such as LZ4 or zstd, ClickHouse can read less data from disk and trade
it for some CPU usage. Extra CPU consumption outweighs extra time spent on disk IO.
Modern compression algorithms are very fast. For example, in LZ4 decompression is only 2.7 times slower than
memory copy operation (13700 MB/s vs 4970 MB/s).
My sample queries report the uncompressed size and the compressed size is 3-10 times smaller. That's why reported scan numbers are so high (37 TB read in 3 minutes).
Since CH is a columnar database, columns are stored as blocks of contiguous data. In practice, that works very well for all data types, even integers. It won't work that well for relational databases, because they store rows of data instead. Rows of data contain different kinds of data, and that makes compression less effective.
You can read more about it here.
If you have any questions, feel free to ask them via e-mail displayed in the footer.