Lock-free clustering of large PostgreSQL data sets

Since 2012-09-27, I have been collecting overstock data from TF2WH every five minutes and storing a snapshot of this data in my database. This enables me to do some really cool things, such as chart past data as well as make projections about future stock levels. For this amount of time, data collection has been proceeding largely uninterrupted and today I have just shy of 40 million individual records.

However, the optimal physical layout of this data is to have the records for each item type grouped together, sorted by timestamp. Due to the nature of time, these records wind up being grouped first by timestamp, and then by item type. This can make lookups of data for a particular item slow, since many database pages need to be visited in order to collect all of the data about a particular item.

PostgreSQL (among other databases, of course) has two primary solutions to deal with this. The simplest one is the CLUSTER statement, which will reorganize all of the data in a table based on an index, such that lookups making use of that index are quick. Since the table’s primary key is composite around the item type and timestamp, this is convenient and does the right thing. Unfortunately, clustering requires an exclusive lock on the table. Sorting and writing out 40 million rows is not a fast operation, and while this is underway no new data can be added to the table, and nobody can query the table for information. This is not a good situation when new data is coming in every five minutes, and users are frequently requesting data of a web application.

The more complex solution is to use table partitioning. This requires that one empty parent table be created, and many child tables be created, each one for a specific item type. The parent table will allow querying across the entire data set, but since each item’s data is stored in a different table, clustering is no longer required — each table will be ordered by timestamp (since new rows are appended) and the separate tables keep the data sets for each item type physically separate on disk. This is a nice solution in theory, but PostgreSQL does not currently provide any mechanism for automatically partitioning a table based on a set of columns; the child tables have to be maintained either by hand, or by script. I like to avoid this kind of complexity when possible.

A compromise solution I’d thought of would allow collection of data to proceed during a cluster operation, but bring down the web application for the duration of the cluster. Before the cluster operation begins, I would have the data collection script retarget insertions into a different table. Once the cluster operation finishes, the script would return to inserting data into the primary table, and the contents of the other table would be transferred over. This is simple, but of course we want to keep the web application responsive if at all possible.

After asking in #postgresql I received some suggestions, and wound up implementing a modified version of one of them. This one allows me to effectively cluster the data set while allowing data collection to proceed and keeping the web application up (though with degraded performance, as the clustering operation and web application will both be waiting on each others’ disk I/O). Further, I was able to implement this change to the production database schema without interrupting data collection, nor taking the web application offline. (Although the application was unreachable for a few seconds due to an oversight on my part regarding table permissions. Theoretically, if I’d foreseen this requirement then there would have been no downtime.)

The table is partitioned into three child tables: static1, static2, and latest. The parent table has a rule that redirects all inserts into the “latest” partition. This serves as the data collection point, where new rows sit until they are migrated into one of the other two partitions. The “static1” and “static2” partitions serve as front and back buffers for the cluster operation. Only one partition will actually have any rows at any given time (the front buffer); the other will be empty (the back buffer).

When I want to perform a cluster operation, the rows from the “latest” partition are moved into the front buffer. Then, the rows from the front buffer are dumped into the back buffer, ordered by the primary key (this is what actually causes the data to be clustered). Finally, the front buffer is truncated to remove the dead rows, and the front and back buffers are swapped. Next time the operation happens, rows will be moved in the opposite direction between the two partitions. This is what the data flow looks like during one cluster operation:

latest --[ rows moved ]--> static1 --[ rows moved and sorted ]--> static2

The next time, it would look like this:

latest --[ rows moved ]--> static2 --[ rows moved and sorted ]--> static1

The magic queries that let all of this happen without interrupting the web application are:

WITH new_rows AS (DELETE FROM some_data_latest RETURNING *)
INSERT INTO some_data_static1
    SELECT * FROM new_rows;

WITH new_rows AS (DELETE FROM some_data_static1 RETURNING *)
INSERT INTO some_data_static2
    SELECT * FROM new_rows
    ORDER BY ...;

TRUNCATE TABLE some_data_static1;

ANALYZE some_data_latest;
ANALYZE some_data_static1;
ANALYZE some_data_static2;

The next time, references to static1 and static2 would be swapped.

There are other options besides TRUNCATE TABLE:

TRUNCATE TABLE static1; will very quickly dispose of the deleted records (usually in less than one second). However, it is not transaction-safe; concurrent transactions may wind up seeing no data in the parent table at all.
VACUUM static1; will slowly dispose of the deleted records, and will not cause concurrency issues. However, the time it takes to complete this operation is not small, and in my experience it causes sub-optimal layout of rows inserted into this partition in the future.
VACUUM FULL static1; will slowly dispose of the deleted records and will ensure that future inserts into this table are done as optimally as possible. However, it requires an exclusive lock on the table. This will prevent the web application from processing requests, and the whole point of this approach is to avoid this.

Based on this pro/con list, TRUNCATE TABLE seems to be the way to go. The window of time during which concurrency issues may result is not very large. However, because of these potential issues, I recommend executing the data-clustering query outside of a transaction — or, at least, only put the first two statements in a transaction. Commit the transaction before truncating the empty table.

(Update 2013-03-04: It’s possible to properly vacuum the table while minimizing the time during which an exclusive lock is held by first doing a standard vacuum and then doing a full one: VACUUM static1; VACUUM FULL static1;. The first vacuum takes a while, but the second one then usually completes in seconds. Of course, any deleted rows that are visible to ongoing transactions will not be removed as a part of this process — and that’s really the only reason why you would vacuum instead of truncate. If you need to reclaim disk space underneath an open transaction, vacuum is the way to go. Otherwise, there is no reason you can’t truncate the table.)

Here is the SQL that creates the tables enabling this approach. (Column names have been replaced with ... so that you may easily tailor this SQL to your own environment.)

-- Parent table.  This is the logical table against which queries
-- are executed.
CREATE TABLE some_data ( ... );

-- Partition for new records.
CREATE TABLE some_data_latest ()
INHERITS (some_data);

-- One of the buffers.
CREATE TABLE some_data_static1 ()
INHERITS (some_data);

-- The other buffer.
CREATE TABLE some_data_static2 ()
INHERITS (some_data);

-- Primary keys.  These will not enforce uniqueness across the
-- entire logical table, but they will serve to index the data.
ALTER TABLE ONLY some_data_latest
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static1
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static2
    ADD PRIMARY KEY (...);

-- Rule directing inserts on the main table to the latest table.
CREATE RULE some_data_insert AS
    ON INSERT TO some_data
    DO INSTEAD
        INSERT INTO some_data_latest
        VALUES (NEW.*);

This is the script that was used to apply this change to a production database. Thankfully, PostgreSQL implements transactional DDL, so we can rename tables out from under the web application and data collections scripts! They will use the old table up until the transaction is committed, at which point they will immediately begin using the new partitioned table.

BEGIN;

ALTER TABLE some_data RENAME TO some_data_latest;

CREATE TABLE some_data ( ... );

ALTER TABLE some_data_latest INHERITS (some_data);

CREATE TABLE some_data_static1 ()
INHERITS (some_data);

CREATE TABLE some_data_static2 ()
INHERITS (some_data);

ALTER TABLE ONLY some_data_static1
    ADD PRIMARY KEY (...);

ALTER TABLE ONLY some_data_static2
    ADD PRIMARY KEY (...);

CREATE RULE some_data_insert AS
    ON INSERT TO some_data
    DO INSTEAD
        INSERT INTO some_data_latest
        VALUES (NEW.*);

COMMIT;

Don’t forget any GRANT statements required to make the new tables accessible by the web application! This is the step I forgot, and it took down the web application for a minute or so.

Once this is complete, run the clustering script given above to perform the initial cluster of your data. While it’s running, you should notice that the web application will continue to be responsive, and new collected data will be immediately visible — just as before! Once the operation concludes, you should notice an increase in performance as lookups for past data will be using a mostly-clustered data set.