Databases are hard. As the number of workloads Bruin runs increases, we continue to encounter unusual issues with the platforms we use, as well as gaps in our knowledge. This time it was…

you deserve a better data platform

1mo

Databases are hard. As the number of workloads Bruin runs increases, we continue to encounter unusual issues with the platforms we use, as well as gaps in our knowledge. This time it was Postgres. One of the databases we run for a new internal orchestration system started hitting 100% CPU usage as we onboarded more workloads on it. We immediately started digging into it: query insights showed a very simple select query that was selecting on an indexed column in a large table with a prepared statement: `... WHERE status = $1` The query plan showed nothing wrong: there was no sequential scan, the plan and execution costs were super low, and the I/O was negligible. Everything was pointing towards a sequential scan, yet nothing in the query plan showed it. We dug deeper: checked autovacuum, WAL, replication lag, index bloat, but everything looked normal. The queries were fast in isolation when we ran them, but the CPU stayed at 100%. Finally, pg_stat_statements gave it away: the same query that was being executed on production was being executed with a different plan than the one we were testing manually. When executed directly (state = 'queued'), it used an index and returned instantly. But when executed as a prepared statement (state = $1), Postgres switched to a generic plan, which ignored the partial index and fell back to scanning the entire table by primary key every single time. The fix was incredibly simple: we replaced the $1 with 'queued' and that was it. The CPU usage immediately went down to a single-digit percentage. Moral of the story: databases are hard.

To view or add a comment, sign in

LinkedIn respects your privacy

Burak Karakan’s Post

Explore content categories