Artem Golubin

Hexora v0.3: New features and improvements

Thu, 25 Jun 2026 15:14:08 -0000

Recently, I've improved my Python library, hexora. I wrote it to detect malicious Python code using static analysis.

In the new v.0.3.0 release, I've added new detections, and we now also use a simple machine learning model to analyze the whole file. The machine learning model uses code structure features, semantic features, and static code analysis to assess the entire Python file.

Although the model can detect malicious code without any detections coming from static analysis, its main use case is to filter false positives.

I've been testing it against newly published PyPI packages and it detects 2-10 new malicious packages each day.

Due to the number of published packages, before the machine learning model, I was getting around 5-10 false positives for 1[......]

Using local ClickHouse for data processing

Sat, 06 Jun 2026 14:14:08 -0000

I did a lot of data engineering work in my career.

When you work a lot with data, you often get quick requests to extract some cold data and process it. Since the data is cold, it usually resides on S3.

For example, one of the typical requests in the past was to count unique values from an old MongoDB backup or CSV dump with 100GB of compressed data.

I quickly learned that using throwaway Python scripts would not work well. Often, the data is too big to fit into memory to sort and deduplicate. Maintaining a Spark cluster was not worth it for this kind of work.

So, what I would do is something like this:

LC_ALL=C aws s3 cp s3://bucket/data.json.gz - \
  | gzip -dc \[......]



NULLs in ClickHouse can hurt performance
Wed, 03 Jun 2026 14:14:08 -0000
When coming from relational databases, NULLs are the go-to for optional fields.
Using them in ClickHouse can lead to unexpected and often unnoticeable performance degradation.
This article explain why.
PostgreSQL
When using null values in PostgreSQL, you rarely notice any difference.
In PG, columns are nullable by default and you can index them.
Internally, each row in PostgreSQL has a bitmap that indicates which columns are NULL. It's only present when
there are null values in a particular row using a bit flag.
PostgreSQL is a row-oriented database, so when you read a row, you read all the columns together.
ClickHouse
Unlike PostgreSQL, ClickHouse is a columnar database.
Instead of storing data by rows, it organizes them by columns.
Each column is stored separately as a contiguous block of data.
Let's suppose we have a table that stores HTTP logs.
We want to store the visitor's user ID, which can be empty for anonymous users.[......]


PyPI packages are increasing rapidly
Sun, 17 May 2026 14:14:08 -0000
PyPI is the main repository for Python packages.
One thing that I've noticed recently is the number of published packages per week.
Let's look at published counts of new package versions per week:


There are some dips in the data, but that's because of how the data was collected.
We can see a clear increase in the number of published packages, especially in the last few months.
Because of AI, the number of packages published per week has increased by 30% since 2025.
I'm working on hexora, a library that detects malicious Python code in packages.
It monitors newly published PyPI packages in real time and analyzes them.
A lot of packages, that have been published recently, are purely vibecoded, and they trigger false positive detections when my tool analyzes them.[......]


The rise of malicious repositories on GitHub
Sun, 15 Mar 2026 17:14:08 -0000
There is an ongoing surge of malicious repositories on GitHub, and the sad thing about it is that
GitHub seems not to care much.
About 10 days ago, I searched for a repo on DuckDuckGo and stumbled upon a fake GitHub repo.
It mimics a legitimate repository, but instead of providing usual releases, it only provides malicious Windows binaries.
Linux/MacOS binaries are not available, and the information on how to build the project was removed from the README file.
The description was also altered using LLMs, removing a lot of technical details.
I reported this repository to GitHub, explaining the problem and showing the report from VirusTotal.
To this day, the repository is still there, and the binaries are still available for download.
The repo has been active for two months. The README gets constantly updated every hour so that it will appear in the
GitHub search higher.

Today, I saw another case of this on X,[......]


Do not fall for complex technology
Thu, 22 Jan 2026 17:14:08 -0000
Fifteen years ago, I wanted to set up a note-taking system.
At the time, Evernote was the tool everyone was talking about, so choosing it seemed like the right and easy decision.
After storing around 500 notes for eight years, Evernote became a mess to use.
It was bloated, heavily monetized, and slow to work with. So I wanted to switch.
About that time came Notion. Everyone was talking about it. I jumped on the bandwagon and migrated a few hundred of
my notes to it that were still relevant. It did not even occur to me that switching from a bloated and slow app to
a web app would result in a similar outcome later. I followed a popular choice again.
After struggling for a year, I switched to Markdown notes and a plugin for an editor that renders inline images.
I'm still using this to this day. It is simple, and I will be able to open my notes 10-20 years later.
I can edit them in any editor. It works offline and does not depend on commercial products.
I encrypt my notes locally so they can be stored safely on any cloud service.[......]


How ClickHouse handles strings
Fri, 16 Jan 2026 17:14:08 -0000
At my work, we use ClickHouse to process billions of records and hundreds of terabytes of data.
ClickHouse is fast, and its speed got me curious to learn some of its internals.
Let's look at a few queries:
SELECT count(*)
FROM cluster.feed
WHERE state = 'unknown'

   ┌──────count()─┐
1. │ 129375618342 │ -- 129.38 billion[......]


You probably don't need Oh My Zsh
Fri, 09 Jan 2026 17:14:08 -0000
Oh My Zsh is still getting recommended a lot.
The main problem with Oh My Zsh is that it adds a lot of unnecessary bloat that affects shell startup time.
Since OMZ is written in shell scripts, every time you open a new terminal tab, it has to interpret all those scripts.
Most likely, you don't need OMZ at all.
Here are the timings from the default setup with a few plugins (git, zsh-autosuggestions, zsh-autocomplete) that are usually recommended:
➜  ~ /usr/bin/time -f "%e seconds" zsh -i -c exit
0.38 seconds


And that's only for prompt and a new shell instance, without actually measuring the git plugin and virtual env plugins (which are often used for Python).[......]


Recent optimizations in Python's Reference Counting
Sun, 04 Jan 2026 17:14:08 -0000
It's been a while since I've written about CPython internals and its optimizations.
My last article on garbage collection was written 8 years ago.
A lot of small optimizations were added since then. In this article, I will highlight a new optimization for
reference counting that uses a static lifetime analysis.
Background on reference counting in CPython
Reference counting is the primary memory management technique used in CPython.
In short, every Python object (the actual value behind a variable) has a reference counter field that tracks how many references point to it.
When an object's reference count drops to zero, the memory occupied by that object is immediately deallocated.
For hot loops, this can lead to significant overhead due to the frequent incrementing and decrementing of reference counts.
The counter must be updated whenever an object is referenced or dereferenced, which hurts performance and trashes CPU caches.
So, when you read a variable in Python, you actually write to the memory as well.[......]


Hash tables in Go and advantage of self-hosted compilers
Sun, 14 Dec 2025 17:14:08 -0000
Recently, I was looking at the Go codebase that was using map[int]bool to track unique values.
As some of you may know, Go has no set data structure. Instead, developers usually use hash maps (key-value data structure).
The first idea that comes to mind is to use map[int]bool, where keys are integers and values are booleans.
But experienced Go developers know that Go has an empty-struct type that can be used for maps: map[int]struct{}.
The benefit of such a type is that it occupies zero bytes in memory; it's a so-called zero-sized type.
The compiler knows this and uses it to your advantage when it can. In this case, it should omit storing values and
keep only keys.
So when I saw the bool struct, my first thought was to switch to map[int]struct{} to save some memory.
In theory, that map can hold more than 100 000 integers.
To my surprise, this change had no effect on memory consumption when running in production.[......]