A first look at pg_repack

I got time to have a look at pg_repack, an interesting extension that helps removing bloat from tables, indexes and databases with the promise of minimal locking.
Since this is a first look, I could be wrong on some aspects, so please apologize me.

The main idea behind pg_repack is to perform an on-line copy of a source (bloated) table, then switching the original table with the new one. In short, something like:

BEGIN;
CREATE TABLE not_bloated AS SELECT * FROM my_table;
ALTER TABLE my_table RENAME TO old_bloated;
ALTER TABLE not_bloated RENAME TO my_table;
DROP TABLE old_bloated;
COMMIT;


Of course things are a lot more complex than the above description, but I think that could be a good summary of what happens.

Why such a workflow would remove bloating?
Well, the idea is that the copy of tuples from the bloated table will, of course, copy only visible tuples (i.e., those that would left after a VACUUM). In other words, dead tuples are not going to hit the new table and therefore the last will not be bloated.

What about locking?
Since the copy is done on-line, the source table (the bloated one) can be used as usual, that is DML queries can be executed against such table. This of course creates a kind of race-condition, since changes are not propagated automatically to the new table.
To solve the problem, pg_repack installs a trigger that will fire for every DML statement and will log changes to a repack.log table, so that pg_repack will be able to replay changes at the end of the copy, that is just before switching the tables.
This is important, according to me, because this means that running pg_repack is not the same as running VACUUM since the new table could have a small fraction of bloating. Why? Well, if during the copy the original table is subjected to a workload that can cause bloating (i.e., UPDATE and DELETE), such bloating will be propagated to the new table as well. <br/<
What about disk space?
Doing a copy of the original table, pg_repack is going to require at least double the size of the original table on disk.

Installing pg_repack

pg_repack is an extension, and therefore can be installed via pgxn (as well as manually, of course):

% sudo  pgxn install pg_repack


Then of course, you need to install the extension into the database you are going to use (or repack):

% psql -U postgres -c "CREATE EXTENSION pg_repack;" testdb
CREATE EXTENSION


Using pg_repack

pg_repack must be invoked from the command line as an external utility. The command accepts pretty much all the usual arguments from libpq:

%  pg_repack -t "luca.wa" -U postgres testdb
INFO: repacking table "luca.wa"


The above will repack a single table, but it is possible to repack all tables in a schema, all tables in a database and so on.

The repack schema

pg_repack installs a repack schema in the database where the extension lives. In such schema there are different tables, mainly temporary for repacking objects. An interesting table is repack.tables that contains all the details for every table that can be repacked. Querying such tables you can see some tricks used in the workflow of pg_repack:

testdb=# select create_log, create_trigger, lock_table 
         from repack.tables 
         where relname = 'luca.wa';
...
create_log     | CREATE TABLE repack.log_16553
                 (id bigserial PRIMARY KEY, pk repack.pk_16553, row luca.wa)
create_trigger | CREATE TRIGGER repack_trigger
                 AFTER INSERT OR DELETE OR UPDATE ON luca.wa
                 FOR EACH ROW EXECUTE PROCEDURE
                 repack.repack_trigger('INSERT INTO repack.log_16553(pk, row)
                                       VALUES( CASE WHEN $1 IS NULL THEN NULL
                                       ELSE (ROW($1.pk)::repack.pk_16553) END, $2)')
lock_table     | LOCK TABLE luca.wa IN ACCESS EXCLUSIVE MODE


As you can see, there are SQL instructions to create a log_xxx table where changed tuples will be logged, as well as the definition of the trigger to attach to the table.
The repack_trigger is a C function that accepts an SQL string (as you can see) and that will execute an insert into the log_xxx table so that:
  • in case of an INSERT the new tuple will be inserted as (null, row);
  • in case of an UPDATE both the new and old tuples will be inserted as (old, new);
  • in case of a DELETE the old tuple only will be inserted as (old, null).


The lock_table is used to lock the table during the initial and final steps, that is at the time the trigger is attached and when the tables are swapped.

Conclusions

pg_repack is surely an interesting extension to keep into the bag. In the future I’m going to spend some time using this extension to see how it performs, but I already know there are happy people using it, so I’m expecting positive results!

The article A first look at pg_repack has been posted by Luca Ferrari on March 25, 2021