<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Artem Golubin</title><link>https://rushter.com/blog/feed/</link><description>Python, Machine learning, NLP, websec, etc.</description><language>en</language><lastBuildDate>Sat, 06 Jun 2026 14:14:08 -0000</lastBuildDate><item><title>Using local ClickHouse for data processing</title><link>https://rushter.com/blog/clickhouse-data-processing/</link><description>&lt;p&gt;I did a lot of data engineering work in my career.&lt;/p&gt;
&lt;p&gt;When you work a lot with data, you often get quick requests to extract some cold data and process it.
Since the data is cold, it usually resides on S3.&lt;/p&gt;
&lt;p&gt;For example, one of the typical requests in the past was to count unique values from an old MongoDB backup or CSV dump with 100GB of compressed data.&lt;/p&gt;
&lt;p&gt;I quickly learned that using throwaway Python scripts would not work well.
Often, the data is too big to fit into memory to sort and deduplicate.
Maintaining a Spark cluster was not worth it for this kind of work.&lt;/p&gt;
&lt;p&gt;So, what I would do is something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;LC_ALL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;C&lt;span class="w"&gt; &lt;/span&gt;aws&lt;span class="w"&gt; &lt;/span&gt;s3&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;s3://bucket/data.json.gz&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gzip&lt;span class="w"&gt; &lt;/span&gt;-dc&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;[......]</description><pubDate>Sat, 06 Jun 2026 14:14:08 -0000</pubDate><guid>https://rushter.com/blog/clickhouse-data-processing/</guid></item><item><title>NULLs in ClickHouse can hurt performance</title><link>https://rushter.com/blog/clickhouse-nulls/</link><description>&lt;p&gt;When coming from relational databases, NULLs are the go-to for optional fields.
Using them in ClickHouse can lead to unexpected and often unnoticeable performance degradation.
This article explain why.&lt;/p&gt;
&lt;h4&gt;PostgreSQL&lt;/h4&gt;
&lt;p&gt;When using null values in PostgreSQL, you rarely notice any difference.
In PG, columns are nullable by default and you can index them.&lt;/p&gt;
&lt;p&gt;Internally, each row in PostgreSQL has a bitmap that indicates which columns are NULL. It's only present when
there are null values in a particular row using a bit flag.&lt;/p&gt;
&lt;p&gt;PostgreSQL is a row-oriented database, so when you read a row, you read all the columns together.&lt;/p&gt;
&lt;h4&gt;ClickHouse&lt;/h4&gt;
&lt;p&gt;Unlike PostgreSQL, ClickHouse is a columnar database.
Instead of storing data by rows, it organizes them by columns.
Each column is stored separately as a contiguous block of data.&lt;/p&gt;
&lt;p&gt;Let's suppose we have a table that stores HTTP logs.
We want to store the visitor's user ID, which can be empty for anonymous users.&lt;/p&gt;[......]</description><pubDate>Wed, 03 Jun 2026 14:14:08 -0000</pubDate><guid>https://rushter.com/blog/clickhouse-nulls/</guid></item><item><title>PyPI packages are increasing rapidly</title><link>https://rushter.com/blog/pypi-packages/</link><description>&lt;p&gt;PyPI is the main repository for Python packages.
One thing that I've noticed recently is the number of published packages per week.&lt;/p&gt;
&lt;p&gt;Let's look at published counts of new package versions per week:&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/uploads/img/2026/pypi_stats_weekly.png" class="ui centered image" &gt;&lt;/p&gt;

&lt;p&gt;There are some dips in the data, but that's because of how the &lt;a href="https://clickpy.clickhouse.com/" target="_blank"&gt;data&lt;/a&gt; was collected.
We can see a clear increase in the number of published packages, especially in the last few months.&lt;/p&gt;
&lt;p&gt;Because of AI, the number of packages published per week has increased by 30% since 2025.&lt;/p&gt;
&lt;p&gt;I'm working on &lt;a href="https://github.com/rushter/hexora" target="_blank"&gt;hexora&lt;/a&gt;, a library that detects malicious Python code in packages.
It monitors newly published PyPI packages in real time and analyzes them.&lt;/p&gt;
&lt;p&gt;A lot of packages, that have been published recently, are purely vibecoded, and they trigger false positive detections when my tool analyzes them.[......]</description><pubDate>Sun, 17 May 2026 14:14:08 -0000</pubDate><guid>https://rushter.com/blog/pypi-packages/</guid></item><item><title>The rise of malicious repositories on GitHub</title><link>https://rushter.com/blog/github-malware/</link><description>&lt;p&gt;There is an ongoing surge of malicious repositories on GitHub, and the sad thing about it is that
GitHub seems not to care much.&lt;/p&gt;
&lt;p&gt;About 10 days ago, I searched for a repo on DuckDuckGo and stumbled upon a fake GitHub repo.
It mimics a legitimate repository, but instead of providing usual releases, it only provides malicious Windows binaries.
Linux/MacOS binaries are not available, and the information on how to build the project was removed from the README file.&lt;/p&gt;
&lt;p&gt;The description was also altered using LLMs, removing a lot of technical details.&lt;/p&gt;
&lt;p&gt;I reported this repository to GitHub, explaining the problem and showing the report from VirusTotal.
To this day, the repository is still there, and the binaries are still available for download.&lt;/p&gt;
&lt;p&gt;The repo has been active for two months. The README gets constantly updated every hour so that it will appear in the
GitHub search higher.&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/uploads/img/2026/vt.png"&gt;&lt;/p&gt;
&lt;p&gt;Today, I saw another case of this on &lt;a href="https://x.com/rebane2001/status/2033208600072425780" target="_blank"&gt;X&lt;/a&gt;,[......]</description><pubDate>Sun, 15 Mar 2026 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/github-malware/</guid></item><item><title>Do not fall for complex technology</title><link>https://rushter.com/blog/complex-tech/</link><description>&lt;p&gt;Fifteen years ago, I wanted to set up a note-taking system.
At the time, Evernote was the tool everyone was talking about, so choosing it seemed like the right and easy decision.&lt;/p&gt;
&lt;p&gt;After storing around 500 notes for eight years, Evernote became a mess to use.
It was bloated, heavily monetized, and slow to work with. So I wanted to switch.&lt;/p&gt;
&lt;p&gt;About that time came Notion. Everyone was talking about it. I jumped on the bandwagon and migrated a few hundred of
my notes to it that were still relevant. It did not even occur to me that switching from a bloated and slow app to
a web app would result in a similar outcome later. I followed a popular choice again.&lt;/p&gt;
&lt;p&gt;After struggling for a year, I switched to Markdown notes and a plugin for an editor that renders inline images.
I'm still using this to this day. It is simple, and I will be able to open my notes 10-20 years later.
I can edit them in any editor. It works offline and does not depend on commercial products.
I encrypt my notes locally so they can be stored safely on any cloud service.&lt;/p&gt;[......]</description><pubDate>Thu, 22 Jan 2026 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/complex-tech/</guid></item><item><title>How ClickHouse handles strings</title><link>https://rushter.com/blog/clickhouse-strings/</link><description>&lt;p&gt;At my work, we use ClickHouse to process billions of records and hundreds of terabytes of data.
ClickHouse is fast, and its speed got me curious to learn some of its internals.&lt;/p&gt;
&lt;p&gt;Let's look at a few queries:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feed&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;unknown&amp;#39;&lt;/span&gt;

&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;┌──────&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;─┐&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;129375618342&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;-- 129.38 billion&lt;/span&gt;[......]</description><pubDate>Fri, 16 Jan 2026 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/clickhouse-strings/</guid></item><item><title>You probably don't need Oh My Zsh</title><link>https://rushter.com/blog/zsh-shell/</link><description>&lt;p&gt;Oh My Zsh is still getting recommended a lot.
The main problem with Oh My Zsh is that it adds a lot of unnecessary bloat that affects shell startup time.&lt;/p&gt;
&lt;p&gt;Since OMZ is written in shell scripts, every time you open a new terminal tab, it has to interpret all those scripts.
Most likely, you don't need OMZ at all.&lt;/p&gt;
&lt;p&gt;Here are the timings from the default setup with a few plugins (git, zsh-autosuggestions, zsh-autocomplete) that are usually recommended:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;➜&lt;span class="w"&gt;  &lt;/span&gt;~&lt;span class="w"&gt; &lt;/span&gt;/usr/bin/time&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;%e seconds&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;zsh&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exit&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt;.38&lt;span class="w"&gt; &lt;/span&gt;seconds
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And that's only for prompt and a new shell instance, without actually measuring the git plugin and virtual env plugins (which are often used for Python).[......]</description><pubDate>Fri, 09 Jan 2026 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/zsh-shell/</guid></item><item><title>Recent optimizations in Python's Reference Counting</title><link>https://rushter.com/blog/python-refcount/</link><description>&lt;p&gt;It's been a while since I've written about CPython internals and its optimizations.
My last article on garbage collection was written 8 years ago.&lt;/p&gt;
&lt;p&gt;A lot of small optimizations were added since then. In this article, I will highlight a new optimization for
reference counting that uses a static lifetime analysis.&lt;/p&gt;
&lt;h3&gt;Background on reference counting in CPython&lt;/h3&gt;
&lt;p&gt;Reference counting is the primary memory management technique used in CPython.&lt;/p&gt;
&lt;p&gt;In short, every Python object (the actual value behind a variable) has a reference counter field that tracks how many references point to it.
When an object's reference count drops to zero, the memory occupied by that object is immediately deallocated.&lt;/p&gt;
&lt;p&gt;For hot loops, this can lead to significant overhead due to the frequent incrementing and decrementing of reference counts.
The counter must be updated whenever an object is referenced or dereferenced, which hurts performance and trashes CPU caches.&lt;/p&gt;
&lt;p&gt;So, when you read a variable in Python, you actually &lt;strong&gt;write&lt;/strong&gt; to the memory as well.&lt;/p&gt;[......]</description><pubDate>Sun, 04 Jan 2026 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/python-refcount/</guid></item><item><title>Hash tables in Go and advantage of self-hosted compilers</title><link>https://rushter.com/blog/go-and-hashmaps/</link><description>&lt;p&gt;Recently, I was looking at the Go codebase that was using &lt;code&gt;map[int]bool&lt;/code&gt; to track unique values.
As some of you may know, Go has no set data structure. Instead, developers usually use hash maps (key-value data structure).&lt;/p&gt;
&lt;p&gt;The first idea that comes to mind is to use &lt;code&gt;map[int]bool&lt;/code&gt;, where keys are integers and values are booleans.
But experienced Go developers know that Go has an empty-struct type that can be used for maps: &lt;code&gt;map[int]struct{}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The benefit of such a type is that it occupies zero bytes in memory; it's a so-called zero-sized type.
The compiler knows this and uses it to your advantage when it can. In this case, it should omit storing values and
keep only keys.&lt;/p&gt;
&lt;p&gt;So when I saw the bool struct, my first thought was to switch to &lt;code&gt;map[int]struct{}&lt;/code&gt; to save some memory.
In theory, that map can hold more than 100 000 integers.&lt;/p&gt;
&lt;p&gt;To my surprise, this change had no effect on memory consumption when running in production.&lt;/p&gt;[......]</description><pubDate>Sun, 14 Dec 2025 17:14:08 -0000</pubDate><guid>https://rushter.com/blog/go-and-hashmaps/</guid></item><item><title>How I am using Helix editor</title><link>https://rushter.com/blog/helix-editor/</link><description>&lt;p&gt;I've been using Helix as my editor to develop on remote servers for quite some time now.&lt;/p&gt;
&lt;p&gt;There are a lot of emerging supply-chain attacks, and I simply don't like the idea of installing tens of plugins to Vim/Neovim to make the editor usable.&lt;/p&gt;
&lt;p&gt;To make the switch from Neovim easier, I had to make some changes to the configuration.
I want to share them to save you some time, because discovering them is not straightforward.&lt;/p&gt;
&lt;h2&gt;Tmux setup&lt;/h2&gt;
&lt;p&gt;I use tmux as a terminal multiplexer.&lt;/p&gt;
&lt;p&gt;One thing that I miss from Neovim setup is a good file manager and TUI for git.
I rarely use a file manager, but when I need to, I usually want to move a bunch of selected files quickly. Unfortunately, Helix does not support file editing in the explorer. You can only view them.&lt;/p&gt;
&lt;p&gt;To overcome it, I added new keybindings to my tmux config:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Yazi related&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-g&lt;span class="w"&gt; &lt;/span&gt;allow-passthrough&lt;span class="w"&gt; &lt;/span&gt;on[......]</description><pubDate>Sat, 11 Oct 2025 20:28:08 -0000</pubDate><guid>https://rushter.com/blog/helix-editor/</guid></item></channel></rss>