Using pg_badplan to CREATE STATISTICS is a good plan

Thanks to Thomas Vondra’s presentation CREATE STATISTICS – What is it for? at Nordic PGDay 2018, my Trustly colleague Claes Jakobsson came up with the idea for a new extension.

The presentation explained how to manually compare the cost rows with the actual rows output from manually executing EXPLAIN ANALYZE.
Claes came up with the idea to do this comparison automatically, right after a query has been executed.
During the conference, Claes implemented the idea and emailed me some code for me to test.

Below is a demo of this extension, using the same data set and queries as in the presentation.

git clone https://github.com/tvondra/create-statistics-talk.git
psql -X -f create-statistics-talk/create.sql

Let’s first try the query without pg_badplan using EXPLAIN ANALYZE:

\copy zip_codes from ~/src/create-statistics-talk/no_postal_codes_utf.csv with csv header;
COPY 4574

EXPLAIN (ANALYZE, TIMING off)
SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Seq Scan on zip_codes  (cost=0.00..108.61 rows=90 width=36) (actual rows=642 loops=1)
   Filter: ((city = 'Oslo'::text) AND (county = 'Oslo'::text))
   Rows Removed by Filter: 3932
 Planning time: 0.357 ms
 Execution time: 0.679 ms
(5 rows)

As you can see, we can manually see the ratio is approx 7.133 (=642/90).

Next, let’s see how we can automate this manual process using pg_badplan.

git clone https://github.com/trustly/pg_badplan.git
cd pg_badplan
make install
echo "shared_preload_libraries = 'pg_badplan'" >> /usr/local/var/postgres/postgresql.conf
echo "pg_badplan.min_row_threshold = 10" >> /usr/local/var/postgres/postgresql.conf
brew services restart postgresql

The pg_badplan.min_row_threshold has been set to 10 just for testing,
the default is 1000, but the query in the example returns fewer rows than that.

We can now try to run the same query without EXPLAIN ANALYZE,
and rely on pg_badplan to notify us in the normal log file.

SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';
tail -f /usr/local/var/postgres/log/postgresql.csv | grep pg_badplan
2018-03-19 11:46:54.154 CET,"joel","joel",91666,"[local]",5aaf951a.16612,5,"SELECT",2018-03-19 11:46:50 CET,4/34,0,LOG,00000,"pg_badplan: rows expected/actual ratio 7.133 exceeded for query SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';",,,,,,"SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';",,,"psql"

Finally, after having analyzed the pg_badplan log output,
we will hopefully be smart enough to manually realise
it would be a good plan to CREATE STATISTICS to help
the planner make a better estimates, which in turn will
help it to select better faster plans.

CREATE STATISTICS s (dependencies) ON city, state, county FROM zip_codes;
ANALYZE zip_codes;

Run the same query again, and watch how pg_badplan will now be silent thanks to the ratio between the estimate and actual rows are now below the pg_badplan.ratio threshold value.

The code is available here: https://github.com/trustly/pg_badplan

pg_catalog visualized

I couldn’t find any graph showing all the relations between all the pg_catalog tables,
so just for fun I wrote a little script to parse the SGML and generate a graph using GraphViz.

#!/usr/bin/perl
use strict;
use warnings;

use File::Slurp qw(slurp);
use Data::Dumper;

open my $fh, "<:encoding(utf8)", './doc/src/sgml/catalogs.sgml' or die "$!";

my $table;
my $column;
my $references;
my $pg_catalog_fk_map = {};
while (my $line = <$fh>) {
    chomp $line;
    if ($line =~ m!^\s+<title><structname>([^<>]+)</> Columns</title>$!) {
        $table = $1;
    } elsif ($line =~ m!^\s+<entry><structfield>([^<>]+)</structfield></entry>$!) {
        $column = $1;
    } elsif ($line =~ m!^\s+<entry><type>(oid|regproc)</type></entry>$!) {
    } elsif (defined $column && $line =~ m!^\s+<entry><literal><link\ linkend="[^"]+"><structname>([^<>]+)</structname></link>\.oid</literal></entry>$!) {
        $references = $1;
        if (!defined($pg_catalog_fk_map->{$table}->{$column}))
        {
            $pg_catalog_fk_map->{$table}->{$column} = $references;
        } elsif ($pg_catalog_fk_map->{$table}->{$column} ne $references) {
            die "Inconsistent data, $table.$column references both $pg_catalog_fk_map->{$table}->{$column} and $references";
        }
        $pg_catalog_fk_map->{$table}->{$column} = $references;
        if (!defined($pg_catalog_fk_map->{$references})) {
            $pg_catalog_fk_map->{$references} = {};
        }
    } else {
        undef($column);
        undef($references);
    }
}

my $dot = qq!
    digraph g {
        graph [
            rankdir = "LR"
        ];
        node [
            fontsize = "16"
            shape = "ellipse"
        ];
        edge [
        ];
!;
foreach my $table (sort keys %{$pg_catalog_fk_map}) {
    $dot .= qq!
        "$table" [
            headlabel = "$table"
            label = "$table | <oid> oid|!;
    foreach my $column (sort keys %{$pg_catalog_fk_map->{$table}}) {
        my $references = $pg_catalog_fk_map->{$table}->{$column};
        $dot .= "<$column> $column|";
    }
    chop($dot);
    $dot .= qq!"
            shape = "record"
        ];
    !;
}
my $id = 0;
foreach my $table (sort keys %{$pg_catalog_fk_map}) {
    foreach my $column (sort keys %{$pg_catalog_fk_map->{$table}}) {
        my $references = $pg_catalog_fk_map->{$table}->{$column};
        $dot .= qq!
        "$table":$column -> "$references":oid [
            id = $id
        ];
        !;
        $id++;
    }
}

$dot .= qq!
    }
!;

print $dot;
./pg_catalog_foreign_keys.pl > pg_catalog.dot 
dot -Tpng -o pg_catalog.png pg_catalog.dot

pg_catalog

PgTerminator: Kill misbehaving db users if important processes are waiting

If you are a DBA and ever have had to manually call pg_terminate_backend() to kill some misbehaving backend process, then this tool might be of interest.

Long running queries are not a problem, as long as they don’t force other important processes to wait.

Another classic is a human user being forgetting to COMMIT.

Please see the Github page for more info:

https://github.com/trustly/pgterminator

Pgcronjob: Cron daemon to run user defined functions

I find myself quite often having to split a large write operation such as updating millions of rows in a heavily used table. This usually boils down to writing a small Perl script that runs a small one-off function that performs the task by updating a limited numbers of rows per run, and then committing in between to avoid a long running transaction.

This time I needed to do this I decided to not write yet another Perl script and to instead do something to improve the situation. I’ve not been able to find any cron-ish tools at all to run database functions, except for pgAgent, but that looks more like a client-side daemon, I wanted something bare-minimum that can run server-side and piggy-back on the OS cron.

A few hours later and 382 lines of code later, the project was finished and pushed to Github:

https://github.com/trustly/pgcronjob

createuser pgcronjob
psql -f install.sql
crontab pgcronjob.crontab
psql

CREATE OR REPLACE FUNCTION public.CronJob_Function_Template_Skeleton()
RETURNS batchjobstate
LANGUAGE plpgsql
SET search_path TO public, pg_temp
AS $FUNC$
DECLARE
BEGIN
RAISE NOTICE 'Hello world!';
PERFORM pg_sleep(random());
RAISE NOTICE 'Slept for a while.';
IF random() < 0.5 THEN
 -- Tell CronJob() we have more work to do and we want it to run us again in due time
 RAISE NOTICE 'See you again!';
 RETURN 'AGAIN';
ELSIF random() < 0.5 THEN
 -- Throw error to CronJob() to test errors
 RAISE EXCEPTION 'Simulate error in CronJob function';
ELSE
 -- Tell CronJob() we're done and we don't want it to run us ever again
 RAISE NOTICE 'Bye world!';
 RETURN 'DONE';
END IF;
END;
$FUNC$;

GRANT EXECUTE ON FUNCTION public.CronJob_Function_Template_Skeleton() TO pgcronjob;

SELECT CronJob_Register('public','cronjob_function_template_skeleton');

 

Hopefully this will be useful for others as well. It would be fun to get some feedback.

The initial commit even comes with a nice README and install and uninstall scripts.

Extract from the README:

The settings are conditions that must all be TRUE for the cronjob to run, i.e. they are AND'd together.

Always NOT NULL:
- Enabled boolean NOT NULL DEFAULT TRUE: Controls whether the cronjob is enabled or not.
- RunEvenIfOthersAreWaiting boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob or not if there are other waiting db txns (pg_stat_activity.waiting).
- RetryOnError boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob ever again if the user-defined function would throw an error.

Can be NULL (which means setting is ignored):
- RunAfterTimestamp timestamptz: Run only after the specified timestamp.
- RunUntilTimestamp timestamptz: Run only until the specified timestamp.
- RunAfterTime time: Run only after the specified time of the day.
- RunBeforeTime time: Run only until the specified time of the day.
- RunInterval interval: Run only after having slept for this interval after the last run started.
- SleepInterval interval: Run only after having slept for this interval after the last run finished.

Finding missing foreign keys

By coincidence I stumbled upon a table where one of its columns didn’t have a foreign key. I found this strange, since our convention is to always add foreign keys when appropriate.

But humans are humans, and will eventually forget to add a foreign key, so we better have both belt and suspenders and find a way to deal with this inevitable problem in an efficient way.

It would be a tedious job to manually look for missing foreign keys in all tables and columns.
But if you’re lucky enough to have the pleasure of working with a system where all tables and columns have been given their names by following a strict naming convention, it might be possible to fully automate the task.

This is the namning convention we use:

  1. Table names are always in plural form with a tailing “s”, e.g. users
  2. Primary key column names are always equal to the table name in singular form with a tailing id, e.g. userid.
  3. Foreign key columns are always equal to the primary key they are referecning, e.g. transactions.userid -> users.userid

This means you always know based on the column name alone, what table if any that column might be referencing.

Example: If table transactions have a column named userid, and there is a table where userid is also the primary key, but if there isn’t any foreign key on transactions.userid, then it’s a missing foreign key, or else someone has not followed the namning convention.

Thanks to PostgreSQL’s fantastic pg_catalog system tables, we can write a query which uses the rules from the naming convention and returns all the columns which appears to be missing foreign keys. It doesn’t support multi-column keys, but we don’t have many of those, so it’s not a problem in my case.

Thanks to the view below, I automatically found three more missing foreign keys of the same type, which saves me many hours of boring work today.

pg1:joel=#* SELECT * FROM view_missing_foreign_keys;
 nspname |      relname    | attname 
---------+-----------------+---------
 public  | transactions    | userid
 public  | someohtertable1 | userid
 public  | someothertable2 | userid
(5 rows)

I’m posting the view I wrote here in hope it might be useful for others with a similar naming convention, and/or to inspire others to include the table name in their column names used as primary/foreign keys.

CREATE OR REPLACE VIEW view_missing_foreign_keys AS
SELECT
    pg_catalog.pg_namespace.nspname,
    pg_catalog.pg_class.relname,
    pg_catalog.pg_attribute.attname
FROM pg_catalog.pg_namespace
INNER JOIN pg_catalog.pg_class ON (pg_catalog.pg_class.relnamespace = pg_catalog.pg_namespace.oid)
INNER JOIN pg_catalog.pg_attribute ON (pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid)
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_attribute.attnum > 0
AND NOT pg_catalog.pg_attribute.attisdropped
AND pg_catalog.pg_namespace.nspname NOT IN ('pg_toast','information_schema','pg_catalog')
AND pg_catalog.pg_attribute.attname LIKE '%id'
AND EXISTS (
    -- The column is PRIMARY KEY in some table
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)
)
AND NOT EXISTS (
    -- There is no FOREIGN KEY on this column
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'f'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) LIKE (format('FOREIGN KEY (%s)',pg_catalog.pg_attribute.attname) || '%')
)
AND NOT EXISTS (
    -- This column is not the PRIMARY KEY of it's own table,
    -- since if it was, we wouldn't require a FOREIGN KEY on it
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)
)
ORDER BY
pg_catalog.pg_namespace.nspname,
pg_catalog.pg_class.relname,
pg_catalog.pg_attribute.attnum

“How we use PostgreSQL at Trustly” at PGConf.EU

Slides from my talk, How we use PostgreSQL at Trustly, are now available.

psql \watch 1400000000 epoch time countdown counter

SET TIMEZONE TO 'UTC';
\t
\a
\pset fieldsep ' '
SELECT
    (('epoch'::timestamptz + 14*10^8 * '1 s'::interval)-now())::interval(0),
    (14*10^8-extract(epoch from now()))::int,
    extract(epoch from now())::int
;
\watch 1

09:18:28 33508 1399966492
09:18:27 33507 1399966493
09:18:26 33506 1399966494
09:18:25 33505 1399966495
09:18:24 33504 1399966496
09:18:23 33503 1399966497

Garbage Collection of Unused PostgreSQL Tables and Columns

Over the last five years, our database at Trustly have kept growing in number of tables, columns and functions, where some of the tables and columns aren’t being used by any database functions any longer. Getting rid of them is important, as otherwise people working with the database will be confused and annoyed. Database developers should always be able to rely on the data model being relevant and up to date.

In our system, no applications access the database tables directly, instead everything goes through stored procedures.

This means, if a table or column name is not present anywhere in any function’s source code, it’s very likely the table/column is not being used by anything. The only exception is if you have dynamically crafted queries executed using EXECUTE, where the table/column names are constructed from different parts. In our system, we thankfully only have a few such cases.

SELECT
Tables.TableName
FROM (
SELECT DISTINCT
regexp_replace(pg_catalog.pg_class.relname,'s$','')
FROM pg_catalog.pg_class
INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
) Tables(TableName)
WHERE NOT EXISTS (
SELECT 1 FROM pg_catalog.pg_proc
WHERE pg_catalog.pg_proc.prosrc ~* Tables.TableName
)
ORDER BY Tables.TableName

This query returned quite a lot of table names with about half of them being false positives,
but still a managable list to go through manually.

50 minutes of manual work later:

92 files changed, 1114 deletions(-)
DROP TABLE: 16
DROP VIEW: 6
DROP SEQUENCES: 7
DROP FK_CONSTRAINTS: 5
DROP CONSTRAINTS: 17
False positives: 14

Then I moved on to the task of finding unused table columns.
The query below excludes any false positives found in the previous query.

SELECT DISTINCT
    Columns.ColumnName
FROM (
    SELECT
        regexp_replace(pg_catalog.pg_attribute.attname,'id$','')
    FROM pg_catalog.pg_class
    INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
    INNER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid
    WHERE pg_catalog.pg_class.relkind = 'r'
    AND pg_catalog.pg_attribute.attnum > 0
    AND NOT pg_catalog.pg_attribute.attisdropped
    AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
    -- Exclude columns in tables we know are unused by the stored procedures, but we want to keep around anyway:
    AND pg_catalog.pg_class.relname !~* '^(alexacountrie|bankersworldonline|bindgroup|clearinghousecurrencie|dailystat|geoiporganization|hourlystat|usercommitment|polishbank|polishbanknumber|swedishpostalcode|testbasedata|testperlmodule|useraccesslogarchive)'
) Columns(ColumnName)
WHERE NOT EXISTS (
    SELECT 1 FROM pg_catalog.pg_proc
    WHERE pg_catalog.pg_proc.prosrc ~* Columns.ColumnName
)
ORDER BY Columns.ColumnName

It took me two hours to go through all code. It was an interesting journey in time with lots of memories.

50 files changed, 5396 insertions(+), 5917 deletions(-)
ALTER TABLE DROP COLUMN: 30
False positives: 87

The reason why there were so many insertions and deletions, was because the dropped columns affected some of the base tables with reference data, which had to be regenerated, thus affecting all lines in those files.

In summary, the ROI on those three hours of time invested is enormous. Developers can now feel confident all tables and columns fulfill a purpose in the system. This exercise will of course need to be repeated in the future though.

SET search_path for all SECURITY DEFINER functions

As pointed out by Peter Eisentraut in a blog post named
Schema Search Paths Considered Pain in the Butt, you need to make sure the search_path is explicitly set for all SECURITY DEFINER functions in PostgreSQL.

Fixing this manually for, in my case, 2106 functions, is, indeed a “pain in the butt”, so I crafted a little query to automate the job:

\t
\pset format unaligned
\o /tmp/fix_search_path_for_security_definer_functions.sql
select
    array_to_string(
        array_agg(
            -- inject SET search_path in-between LANGUAGE and SECURITY DEFINER in the declaration
            regexp_replace(
                pg_get_functiondef(oid),
                E'(LANGUAGE [a-z]+)\\s+(SECURITY DEFINER)',
                E'\\1\n SET search_path TO public, pg_temp\n \\2'
            )
        ),
        ';'
    )
from pg_proc
where prosecdef is true -- SECURITY DEFINER functions
-- don't include functions for which we have already specified a search_path
and not (coalesce(array_to_string(proconfig,''),'') like '%search_path%')
-- public schema
and pronamespace = 2200
;
\t
\o
\i /tmp/fix_search_path_for_security_definer_functions.sql
-- If all goes well you should see a lot of CREATE FUNCTION being spammed on the screen

Fun with pg_catalog.pg_depend

Learning PostgreSQL and SQL in general probably begins with the concept of TABLES. Then probably VIEWS, INDEXES and maybe TRIGGERS.
Some users might not ever go any further, which is sad, because there is so much more to explore!

I thought it would be cool to automatically generate a graph showing the dependencies between objects and their types.
This shows the order in which the different types of objects can be created,
perhaps mentally useful to think in terms of “after we have created a TABLE we can create an INDEX”.

Something like this would be nice to include in the PostgreSQL documentation online.
I think it would be helpful when learning about PostgreSQL different object types.

The graph below was produced using GraphViz dot command and live-data from pg_catalog.pg_depend:

pg_depend_dot

As we can see, before anything else, we need a SCHEMA, which is the root node.
Once we have a SCHEMA, we can create TABLES, TYPES, VIEWS, SEQUENCES and FUNCTIONS.
Some users might not even know about SCHEMAs, as the schema “public” is pre-installed.
To create an INDEX, we first need a TABLE.
Etc, etc, etc…

You might be surprised FUNCTION and LANGUAGE have arrows pointing in both directions.
Turns out you need some functions before you can create a language like plperl, such as plperl_call_handler.
The self-referencing arrow from/to FUNCTION is less surprising as some functions can of course call other functions.

(Not all object types are included in this graph as I’m not using them all in my system.)