PostgreSQL Extended Statistics

PostgreSQL 10 defines a set of extented statistics, mainly for intra-column dependencies and distinct values. New commands have been added to create and rop such extended statistics, and this post just covers the surface of this new feature directly as long as I was experimenting with that (and as usual, comments are welcome!).

A sample data set

Assume you have a table defined as follows:
CREATE TABLE expenses(
   pk int GENERATED ALWAYS AS IDENTITY,
   value money,      
   day date,
   quarter int,
   year int,
   incoming boolean DEFAULT false, 
   account  text DEFAULT 'cash',
   PRIMARY KEY ( pk ),
   CHECK( value <> 0::money ),
   CHECK( quarter = EXTRACT( quarter FROM day ) ),
   CHECK( year = EXTRACT( year FROM day ) )
);
clearly there are a few dependencies:
  • column quarter depends on the value of column day;
  • column year depends on the value of day too,
  • column incoming is true when the value is greater than zero, false otherwise.
And since there are these dependencies, there must be something that ensures us the columns move together, so for instance there is a simple trigger as follows:
CREATE OR REPLACE FUNCTION f_expenses()
RETURNS TRIGGER
AS $code$
BEGIN
  IF NEW.day IS NULL THEN
     NEW.day := CURRENT_DATE;
  END IF;
  
  IF NEW.value = 0::money THEN
     NEW.value := 0.01::money;
  END IF;

  NEW.quarter  := EXTRACT( quarter FROM NEW.day );
  NEW.year     := EXTRACT( year FROM NEW.day );
  IF NEW.value > 0::money THEN
     NEW.incoming := true;
  ELSE
    NEW.incoming := false;
  END IF;

  RETURN NEW;
END
$code$
LANGUAGE plpgsql;



CREATE TRIGGER tr_expenses BEFORE INSERT OR UPDATE ON expenses
FOR EACH ROW
EXECUTE PROCEDURE f_expenses();
In a /normal/ situation the planner cannot know such dependencies and, consequently, cannot take advantage of them. In order to see how this can change in PostgreSQL 10, let’s first insert some records:
WITH v(value, counter, account) AS (
   SELECT ( random() * 100 )::numeric + 1 * CASE v % 2 WHEN 0 THEN 1 ELSE -1 END,
          v,
          CASE v % 3 WHEN 0 THEN 'credit card'
                     WHEN 1 THEN 'bank'
                     ELSE 'cash'
         END
  FROM generate_series( 1, 2000000 ) v
)  
          
INSERT INTO expenses( value, day, account )
SELECT value, CURRENT_DATE - ( counter % 1000 ), account
FROM v;
which inserts 2 millions rows (roughly 130 MB of data), 2000 tuples per day. It is possible to check how many values are there, for instance in the year 2016:
SELECT                                     
 count(*) FILTER( WHERE year = 2016 ) AS total_2016_by_year, 
 count(*) FILTER( WHERE EXTRACT( year FROM day ) = 2016 ) AS total_2016_by_day,
 count(*) FILTER( WHERE year = 2016 AND incoming = true AND day = '2016-7-19'::date ) AS incoming_by_year, 
 count(*) FILTER( WHERE EXTRACT( year FROM day ) = 2016 AND incoming = true AND day = '2016-7-19'::date ) AS incoming_2016_by_day,
 pg_size_pretty( pg_relation_size( 'expenses' ) ) AS table_size
 FROM expenses;

total_2016_by_year   | 732000
total_2016_by_day    | 732000
incoming_by_year     | 1982
incoming_2016_by_day | 1982
table_size           | 136 MB

Without the extended statistics

What happens when we search for all the expenses of a particular year?
EXPLAIN  
SELECT *
FROM expenses
WHERE incoming = true
AND   value > 0.0::money
AND   day = '2016-7-19'::date
AND   year = 2016;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..35128.37 rows=697 width=32)
   Workers Planned: 2
   ->  Parallel Seq Scan on expenses  (cost=0.00..34058.67 rows=290 width=32)
         Filter: (incoming AND (day = '2016-07-19'::date) AND (year = 2016) AND (value > (0.0)::money))

The planner estimates 697 tuples while we know that the above filtering predicates will provide 1982 tuples.

With the extended statistics

Let’s inform the system about the column dependencies:
CREATE STATISTICS stat_day_year ( dependencies )
 ON day, year
 FROM expenses;

CREATE STATISTICS stat_day_quarter ( dependencies )
 ON day, quarter
 FROM expenses;
 
CREATE STATISTICS stat_value_incoming( dependencies )
 ON value, incoming
 FROM expenses;
There is a special view pg_statistic_ext that holds data about the created statistics, but it gets updated only by an ANALYZE command. Each row in the view corresponds to a created statistics and includes the type and relationship:
SELECT * FROM pg_statistic_ext;

-[ RECORD 1 ]---|---------------------
stxrelid        | 51015
stxname         | stat_value_incoming
stxnamespace    | 2200
stxowner        | 16384
stxkeys         | 2 6
stxkind         | {f}
stxndistinct    | 
stxdependencies | {"2 => 6": 1.000000}
-[ RECORD 2 ]---|---------------------
stxrelid        | 51015
stxname         | stat_day_quarter
stxnamespace    | 2200
stxowner        | 16384
stxkeys         | 3 4
stxkind         | {f}
stxndistinct    | 
stxdependencies | {"3 => 4": 1.000000}
-[ RECORD 3 ]---|---------------------
stxrelid        | 51015
stxname         | stat_day_year
stxnamespace    | 2200
stxowner        | 16384
stxkeys         | 3 5
stxkind         | {f}
stxndistinct    | 
stxdependencies | {"3 => 5": 1.000000}
Therefore column 3 determines both columns 4 and 5 (see stxkeys) by a functional {f} dependency, and in particular target columns are fully computed by their dependendant column (100%) as reported in the stxdependencies column. The same applies for column 2 (value) that determines column 6 (incoming). What is now the query plan?
 EXPLAIN 
SELECT *
FROM expenses
WHERE incoming = true
AND   value > 0.0::money
AND   day = '2016-7-19'::date
AND   year = 2016;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..35251.67 rows=1930 width=32)
   Workers Planned: 2
   ->  Parallel Seq Scan on expenses  (cost=0.00..34058.67 rows=804 width=32)
         Filter: (incoming AND (day = '2016-07-19'::date) AND (year = 2016) AND (value > (0.0)::money))
As you can see the planner now can estimates almost exactly the number of tuples that will be returned (1930 exstimated against the 1982 actual). The fact is that now the planner knows the predicate pairs on incoming and value, as well as day and year are now tied due to a functional dependency and therefore it will not multiplicate each individual ratio to get the final filtering ratio.

What about indexes?

Initially I thought that the new extended statistics could automagically understand also index depdencies, so that an index built on an expression of day can be used to retrieve also year. Clearly, I was wrong, and that proves that I have to do a better job learning this new feaure!.

The article PostgreSQL Extended Statistics has been posted by Luca Ferrari on June 28, 2018