Using local ClickHouse for data processing

Last updated on June 06, 2026

I did a lot of data engineering work in my career.

When you work a lot with data, you often get quick requests to extract some cold data and process it. Since the data is cold, it usually resides on S3.

For example, one of the typical requests in the past was to count unique values from an old MongoDB backup or CSV dump with 100GB of compressed data.

I quickly learned that using throwaway Python scripts would not work well. Often, the data is too big to fit into memory to sort and deduplicate. Maintaining a Spark cluster was not worth it for this kind of work.

So, what I would do is something like this:

LC_ALL=C aws s3 cp s3://bucket/data.json.gz - \
  | gzip -dc \
  | grep '"instance":"prod"' \
  | jq -r '[.field1, .field2] | join(",")' \
  | sort --buffer-size=8G -T /tmp -j 8 \
  | uniq -c \
  | sort --buffer-size=8G -T /tmp -j 8 -nr

It's a bunch of GNU tools glued together with pipes. Often, it would also contain awk or sed to convert the output to the needed format.

It works fast, especially when you don't need to parse JSONs. When parsing large CSV files, cut can processes them very fast if you don't have quotes and escaped characters.

This pipeline does not download the whole file since we stream files from S3 on the fly. Unlike a Python script, it uses temporary files on disk to sort and deduplicate data that does not fit into memory. Even deduplication requires storing all unique elements in memory.

The LC_ALL=C is a cool trick that ensures sorting is done byte-wise and is not by unicode characters, which is faster.

In one of my previous jobs, almost everyone was familiar with such pipelines. Thanks to a very popular article: Command-line Tools can be 235x Faster than your Hadoop Cluster. Such an approached hooked a lot of people.

clickhouse-local

About one year ago, I discovered that ClickHouse has clickhouse local tool. It's a ClickHouse server that runs locally and can process data from different sources.

It's a single binary, and it does not have any dependencies. You can just copy it and you are good to go. It's meant to be used for temporary data processing and restricts network access.

The good thing about it, is that it works almost the same as a regular server, so it supports a lot of external datasources.

We can easily replicate my pipeline with it:

SELECT
    field1,
    field2,
    count() AS cnt
FROM s3(
    'https://s3.west.amazonaws.com/bucket/data.json.gz',
    'JSONEachRow'
)
WHERE isntance = 'prod'
GROUP BY
    field1,
    field2
ORDER BY
    cnt DESC
INTO OUTFILE 'result.csv'
FORMAT CSV;

By using a local server, I don't have to use sed, awk, jq just to process data and output it in the correct format. Writing SQL queries is much more convenient, and I make fewer mistakes when processing the data. All I need to do is pass the SQL query to clickhouse-local and it will do the rest.

It works fast, uses multiple cores where possible, and can spill to disk if the data does not fit into memory (although not for every query). You get almost the same benefits as a single-node ClickHouse server.

Since I'm pretty familiar with ClickHouse, I can also use its features to speed up the processing. In a more complex scenarios, I create temporary tables to hold the data in a more efficient way before doing the final processing. Things like LowCardinality strings help a lot and they are easy to plug in. You can specify smaller types if you know that the data fits into them. Unlike bash pipilines, if your data contains garbage - you will get an error.

This is such an underappreciated use case of ClickHouse and not a lot of people know about it.

ClickHouse supports different datasources, including:

  • S3
  • HDFS
  • local files
  • HTTP
  • MySQL
  • PostgreSQL
  • MongoDB
  • Kafka
  • Hive

As you can see, it can even query other databases. In theory, you can fetch data from MySQL and PostgreSQL in one query, join them together, and output the result to a file.

I haven't tried it, but it will probably work:

SELECT
    u.id AS user_id,
    u.username,
    max(l.access_time) AS last_seen
FROM postgresql(
    'pg:5432',
    'main',
    'users',
    'pg_user',
    'pg_password'
) AS u
JOIN mysql(
    'mysql:3306',
    'main',
    'access_logs',
    'mysql_user',
    'mysql_password'
) AS l
ON u.id = l.user_id
WHERE u.id = 11111
GROUP BY
    u.id,
    u.username
ORDER BY last_seen DESC;
INTO OUTFILE 'result.csv'
FORMAT CSV;

It also supports 70 output formats, including:

  • CSV
  • JSON
  • XML
  • Parquet
  • Avro
  • Protobuf

This setup is so simple, but it beats a lot of complex data processing pipelines. You don't need to be a data engineer to use it, knowing SQL is enough. If performance becomes a bottleneck, you can easily switch to a production cluster and distribute the workload across multiple nodes.

How I'm using it now

If I need to process data quickly for a one-time job, I now use a local CH server. The usual workflow involves either creating a temporary EC2 instance or using a dev server to run queries. I can also use one node from one of the production servers, but I prefer to avoid it so it does not affect production workloads.

If you need to process a really big dataset, you can easily switch to a production server and distribute the workload across the cluster. By simply replacing s3(...) with s3Cluster(..), the processing will be distributed across the cluster, where, usually, you have terabytes of memory and hundreds of CPU cores.

Both s3 and S3Cluster support partitions, so you can feed many files at once using the glob pattern:

SELECT *
FROM s3Cluster(
  'my_cluster',
  'https://bucket/data/*.json',
  'JSONEachRow'
);

As a funny note, while I was writing this article, I upgraded ClickHouse from version 25 to 26 to play with queries. As it turned out, they broke the disk spilling for local setups. It's now hardcoded to 1Gb. So, when your data can't fit in memory, use ClickHouse 25.

commit 80b735a2f79a1699102bd32ee708d68b3767532b
Author: Alexey Milovidov <milovidov@clickhouse.com>
Date:   Wed Feb 4 00:44:03 2026 +0100

    I think 1 Gi is enough

No, Alexey, 1GB is not enough, please bring it back :).


If you have any questions, feel free to ask them via e-mail displayed in the footer.
All articles on this website are written by a human.

Comments

There are no comments for this post. Be the first to share your thoughts.

Leave a comment