A ClickHouse segfault in query planning (and the 5-line fix)
tl;dr summary
A dev added isNotNull() to a filter on the right side of an ANY LEFT JOIN, and ClickHouse started segfaulting during join optimization with allow_experimental_analyzer=1. The root cause was operator[] default-constructing ColumnWithTypeAndName with type == nullptr. The fix was to use find() and a typed stub for missing arguments, plus a regression test.
table of contents
We had just migrated a KPI dashboard from Postgres to ClickHouse. About 30 minutes after the day started, a dev sent me a Slack message: they added a WHERE clause and the query now crashed with ECONNRESET. ClickHouse was segfaulting and getting restarted by Docker.
Setup
This happened with:
- ClickHouse 25.12 in Docker on bare metal
- Native protocol on port 9000 from the TypeScript client
allow_experimental_analyzer = 1enabled intentionally because we needed a feature behind it- A single dashboard route in dev that reliably triggered the crash
The repro uses HTTP on port 8123 for convenience; the crash also happens over the native protocol.
I later rebuilt ClickHouse from source on 26.1.1.1 to get a stack trace with file and line numbers. I used gdb with gef for the traceback, and the code quality stood out: after about 37 minutes of compiling Rust, C, and C++ code, it finished on the first try with zero warnings.
Minimal reproduction
I reduced the original query to a minimal repro: an ANY LEFT JOIN, a filter on the right side, and isNotNull(...) with the analyzer enabled.
Start ClickHouse:
docker run --rm -p 8123:8123 -p 9000:9000 \
-e CLICKHOUSE_USER=default \
-e CLICKHOUSE_PASSWORD=default \
clickhouse/clickhouse-server:25.12
Create a minimal schema and data set:
CREATE TABLE Session
(
id String,
site Enum8('STORE_A' = 1, 'STORE_B' = 2),
device Enum8('DESKTOP' = 1, 'MOBILE' = 2)
)
ENGINE = MergeTree
ORDER BY id;
CREATE TABLE AddedToCart
(
sessionId String,
order Int32,
top Nullable(Int32),
screenHeight Nullable(Int32),
screenWidth Nullable(Int32),
isPromotion UInt8,
date DateTime64(3)
)
ENGINE = MergeTree
ORDER BY (sessionId, date);
INSERT INTO Session (id, site, device) VALUES
('s1', 'STORE_A', 'DESKTOP'),
('s2', 'STORE_B', 'MOBILE');
INSERT INTO AddedToCart (sessionId, order, top, screenHeight, screenWidth, isPromotion, date) VALUES
('s1', 1, 100, 400, 1024, 1, parseDateTime64BestEffort('2026-01-19T12:00:00.000Z', 3)),
('s2', 2, 100, 400, 1024, 1, parseDateTime64BestEffort('2026-01-19T12:00:01.000Z', 3));
Enable the experimental analyzer and run the query:
SET allow_experimental_analyzer = 1;
SELECT
s.site AS site,
if(
(a.order IS NULL) OR (a.order <= 0) OR (a.order > 30),
NULL,
accurateCastOrNull(a.order, 'Int32')
) AS page_level,
count() AS count
FROM AddedToCart AS a
ANY LEFT JOIN Session AS s ON a.sessionId = s.id
WHERE (a.top IS NOT NULL)
AND (a.screenHeight IS NOT NULL)
AND (a.screenHeight > 0)
AND (a.isPromotion = _CAST(1, 'UInt8'))
AND (s.device = 'DESKTOP')
AND isNotNull(s.site)
GROUP BY site, page_level
ORDER BY site ASC, page_level ASC
FORMAT JSONEachRow;
Trigger conditions were obvious:
- If I replaced
ANY LEFT JOINwithANY INNER JOIN, the crash disappeared. - If I removed
isNotNull(s.site), the crash disappeared. - If I disabled
allow_experimental_analyzer, the crash disappeared.
So the crash was not random and it was not about data size. It was a specific interaction between join optimization, partial evaluation, and a type-aware function (isNotNull) running at planning time.
What the stack trace pointed to
Once I rebuilt ClickHouse locally (26.1.1.1) to get symbols, the trace consistently passed through this chain:
QueryPlanOptimizations::tryConvertAnyOuterJoinToInnerJoin
-> filterResultForNotMatchedRows (src/Processors/QueryPlan/Optimizations/Utils.cpp)
-> ActionsDAG::evaluatePartialResult (src/Interpreters/ActionsDAG.cpp)
-> executeActionForPartialResult
-> FunctionIsNotNull::getConstantResultForNonConstArguments (src/Functions/isNotNull.cpp:61)
-> elem.type->onlyNull() deref when type == nullptr
ClickHouse was crashing while trying to optimize the query plan, not while executing the join itself. The optimizer evaluated a filter to see if the outer join could be simplified and ended up calling isNotNull() with an invalid argument.
Why it only happens with the analyzer
In FunctionIsNotNull::getConstantResultForNonConstArguments, ClickHouse short-circuits when the old analyzer is used:
if (!use_analyzer)
return nullptr;
use_analyzer comes from allow_experimental_analyzer (renamed to enable_analyzer, with the alias preserved), so the crash path only executes when the new analyzer is enabled.
Why ClickHouse evaluates filters during join optimization
ClickHouse represents filters and projections as an expression graph, an ActionsDAG. During planning, the optimizer tries to turn outer joins into inner joins when the WHERE clause removes not matched rows anyway. That requires evaluating the filter under a not matched rows scenario, which means some right-side expressions are unknown. The crash happened because “unknown” became “invalid”.
The root cause: operator[] inserted a poisoned default
Inside ActionsDAG::evaluatePartialResult, there is a map from DAG nodes to partially-evaluated columns, roughly:
- key:
ActionsDAG::Node const * - value:
ColumnWithTypeAndName
ColumnWithTypeAndName is defined as:
struct ColumnWithTypeAndName
{
ColumnPtr column; // may be nullptr for headers
DataTypePtr type; // expected to be set
String name;
};
The comment allows column == nullptr to represent headers, but it never suggests type can be null. Default construction (ColumnWithTypeAndName()) leaves both column and type null.
Not every child node has a value at partial evaluation time, especially when the expression references the right side of an outer join and the optimizer is simulating not matched rows. filterResultForNotMatchedRows runs partial evaluation with:
allow_unknown_function_arguments = true
throw_on_error = false
Some right-side columns are expected to be unknown, as long as their type metadata survives.
The original code built function arguments like this:
arguments[i] = node_to_column[node->children[i]];
That looks like a read, but std::unordered_map::operator[] is not a read. If the key is missing, it inserts a default-constructed value, so missing children became column == nullptr, type == nullptr, name == "".
So a missing argument (which was supposed to be unknown but still well-formed) became an argument whose type pointer was literally null.
isNotNull() does exactly that during constant folding:
if (elem.type->onlyNull()) ...
if (canContainNull(*elem.type)) ...
So it eventually dereferenced the null type pointer in FunctionIsNotNull::getConstantResultForNonConstArguments.
The crash comes from a subtle C++ detail: using operator[] on a map in a context where missing keys are expected.
The fix: represent “unknown” as typed, not as null
I did not try to redefine ClickHouse’s partial evaluation rules. The pass explicitly allows unknown arguments, and that is necessary for the optimization to exist at all.
The fix was to stop turning missing into default-constructed invalid by removing operator[] usage and using find() instead. When a child was missing from the map, I created a typed stub using the node metadata: {nullptr, child->result_type, child->result_name}.
That keeps the column pointer null (the value is still unknown), but it preserves the type so type-based functions can reason about it without crashing:
const auto * child = node->children[i];
if (auto it = node_to_column.find(child); it != node_to_column.end())
arguments[i] = it->second;
else
arguments[i] = ColumnWithTypeAndName{nullptr, child->result_type, child->result_name};
With this in place, isNotNull() can fold constants safely even when some arguments are missing, and the optimizer can decide whether the outer join conversion is valid.
Regression test
I added a regression test that mirrors the reproduction steps and asserts that the query executes and the server remains alive.
Impact
The patch does not change query semantics and it does not introduce measurable overhead. It replaces an accidental insertion into an unordered map with a lookup and a stub object.
Takeaways
If you are using ClickHouse with allow_experimental_analyzer = 1, keep in mind that optimizer paths can be exercised more aggressively. If you see an ECONNRESET from a single query and nothing else looks wrong, check whether the server actually survived the request.
Also, in C++, map[key] is a surprisingly expensive way to learn a life lesson.