5 Novelties in PostgreSQL 10 That Will Make You a Happier Developer

31.10.2017|18 min read
Krzysztof Godlewski
Krzysztof GodlewskiHead of Web
ShareLinkedInTwitterFacebook
5 Novelties in PostgreSQL 10 That Will Make You a Happier Developer

In the past few years, we’ve heard a lot about NoSQL databases: Redis, MongoDB, Cassandra, Memchached, or DynamoDB, to name a few. Even though they gained a lot of deserved hype, the good old relational databases are mostly irreplaceable. PostgreSQL, an object-relational (we will get to it later) database management system, is still gaining new fans.

In this post, I will look at some examples of features introduced in the newest version of Postgres, which was released on 5 October. Throughout this post, I will present most of the queries on Postgres 10 on Ubuntu 16. Only the first part about identity columns will show how it used to be done in “the ol’ days” on Postgres 9.6 (on the left-hand side, as below).

Environment Setup

The setup process is rather easy. First, you need to install Docker CE according to your system preferences.

PostgreSQL 9.6

$ docker pull postgres:9.6 $ docker run --name old-postgres -d postgres:9.6 $ docker run -it --rm --link old-postgres:postgres postgres:9.6 psql -h postgres -U postgres 

PostgreSQL 10

$ docker pull postgres:10 $ docker run --name new-postgres -d postgres:10 $ docker run -it --rm --link new-postgres:postgres postgres:10 psql -h postgres -U postgres 

After pulling and starting the images, you should see prompts with the Postgres version you are using:

$ docker pull postgres:9.6 $ docker run --name old-postgres -d postgres:9.6 $ docker run -it --rm --link old-postgres:postgres postgres:9.6 psql -h postgres -U postgres 
$ docker pull postgres:10 $ docker run --name new-postgres -d postgres:10 $ docker run -it --rm --link new-postgres:postgres postgres:10 psql -h postgres -U postgres 

Now we’re ready to go!

1. Identity columns

After switching from an MS SQL Server, this was something I could not get used to: identity columns. In short, these are the columns that store the IDs of your rows – they should be unique and increment automatically.

CREATE TABLE foo (id SERIAL PRIMARY KEY, val1 INTEGER); CREATE TABLE \d List of relations Schema | Name | Type | Owner --------+------------+----------+---------- public | foo | table | postgres public | foo_id_seq | sequence | postgres 
CREATE TABLE foo (id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, val1 INTEGER); CREATE TABLE \d List of relations Schema | Name | Type | Owner --------+------------+----------+---------- public | foo | table | postgres public | foo_id_seq | sequence | postgres 

The syntax is quite similar on both sides and underneath there still lies a sequence. So what’s the big deal? First of all, the new syntax is SQL-compliant, so it should now be easier to run code on different databases. What is more, you can explicitly decide on the behavior while inserting by choosing either ALWAYS or DEFAULT.

Another convenient thing is changing the next value of a sequence, which I used a lot while migrating databases. In the new version, you don’t have to pay attention to the sequence your table is using, because all the magic is done by PG:

ALTER SEQUENCE foo_id_seq RESTART WITH 1000; ALTER SEQUENCE 
ALTER TABLE foo ALTER COLUMN id RESTART WITH 1000; ALTER TABLE 

If this still does not convince you, there is also one more thing really exciting that I was not aware of before. Suppose we would like to create a copy of the table foo:

CREATE TABLE bar (LIKE foo INCLUDING ALL); CREATE TABLE \d List of relations Schema | Name | Type | Owner --------+------------+----------+---------- public | bar | table | postgres public | foo | table | postgres public | foo_id_seq | sequence | postgres 
CREATE TABLE foo (id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, val1 INTEGER); CREATE TABLE \d List of relations Schema | Name | Type | Owner --------+------------+----------+---------- public | foo | table | postgres public | foo_id_seq | sequence | postgres 

Both operations resulted in creating a new table, but only in pg10, a new sequence is created as well. So if you wanted to insert some new rows into the table bar, in version 9.6 you would get id 1000 (it uses foo_id_seq), and in version 10, you will start from 1, which in my opinion is the desired result. Now, if you wanted to drop foo, you could come across some problems:

DROP TABLE foo; ERROR: cannot drop table foo because other objects depend on it DETAIL: default for table bar column id depends on sequence foo_id_seq HINT: Use DROP ... CASCADE to drop the dependent objects too. 
DROP TABLE foo; DROP TABLE 

You cannot drop foo, since its sequence is being used elsewhere. If you decide to do as the hint suggests, you are in for a hard time inserting values into the bar, though. And what if you created a lot of tables by copying some existing ones? You would end up with a lot of mess…

If you want to see some other monstrosities of the old approach, I encourage you to check out this article.

2. Native partitioning

This is a feature that I’m extremely hyped about because of its power and possible applications. I think that we’re still some way from home, but, in comparison to 9.6, it’s a big step forward for Postgres when it comes to ease of use and performance.

What is partitioning all about? Let me give you an example.

Some time ago, my colleagues at 10Clouds wrote about their attempt to predict smog levels by using machine learning. Consider a table that stores all the measurements. A minimal model table would have a timestamp of a sample’s acquisition, site id and type of pollutant (CO2, PM10, PM2.5, and so on). After some time, the amount of data in the table would grow considerably, and the performance would suffer.

» Will Vue.js Become a Giant Like Angular or React? Read our article to find out!

Depending on the type of calculations you perform most frequently, we could try splitting the data. If we decide that we want to have a separate table, e.g. for measurements from each month, this is where partitioning comes in handy. If the application usually queries the database for the records from the last two weeks, then we would have to scan at most two partitions instead of all the records. Of course, we could specify in the query which tables we should scan, but that would be troublesome and ugly, to be honest.

What we want to achieve is:

  • Create another level of abstraction – we want to query only one (master) table
  • The data themselves should be dispatched to different child tables based on the sample’s timestamp

Here’s how to accomplish this task in Postgres:

  1. Create a master table.
  2. Create as many child tables with datetime constraints as needed.
  3. Create indices, keys, and other constraints on child tables.
  4. Create a trigger on the master table that will dispatch rows to proper child tables before insert.

As of Postgres 10, point 4 is no longer needed – DBMS will handle it, and, thus, the syntax becomes greatly simplified. A sample implementation might look like this:

-- 1. Create master table, specify partitioning rule CREATE TABLE measurement( id INTEGER GENERATED ALWAYS AS IDENTITY, datetime TIMESTAMPTZ, site_id INTEGER, pollutant_id INTEGER, value FLOAT) PARTITION BY RANGE (datetime); -- 2. Create a couple of child tables. Define data range limits they should store CREATE TABLE measurement_201708 PARTITION OF measurement(datetime) FOR VALUES FROM ('2017-08-01') TO ('2017-09-01'); CREATE TABLE measurement_201709 PARTITION OF measurement(datetime) FOR VALUES FROM ('2017-09-01') TO ('2017-10-01'); CREATE TABLE measurement_201710 PARTITION OF measurement(datetime) FOR VALUES FROM ('2017-10-01') TO ('2017-11-01'); -- 3. Add all needed keys and indices, for each child table ALTER TABLE measurement_201708 ADD PRIMARY KEY (id); ALTER TABLE measurement_201708 ADD CONSTRAINT fk_measurement_201708_site FOREIGN KEY (site_id) REFERENCES site(id); CREATE INDEX idx_measurement_201708_datetime ON measurement_201708(datetime); 

Assuming that we created tables with sites and pollutants, this is how we might feed the measurement table and then query it:

-- Inserting looks the same as in normal tables INSERT INTO measurement(datetime, site_id, pollutant_id, value) SELECT '2017-08-01'::TIMESTAMPTZ + ((random()*90)::int) * INTERVAL '1 day', (1 + random()*(SELECT max(id)-1 FROM site))::int, (1 + random()*(SELECT max(id)-1 FROM pollutant))::int, random() FROM generate_series(1,1000000); -- Selecting data as well SELECT * FROM measurement WHERE datetime BETWEEN '2017-09-20' AND '2017-09-27'; 

The new syntax provides easier partitioning of data in specific ranges or belonging to specific categories. This means that we could also use a different approach and create a separate table for every measurement site. If one of them happens to be deactivated, we can easily detach this partition from the master table and archive it. What is probably the most important part is that the performance of native partitioning improved greatly. I recommend reading depesz’s blog post, where he explores it in depth.

On the other hand, there are still many issues with partitions: you have to set keys and indices for each child table, you can’t set a unique key across different child tables, etc. But we can see that partitioning is being worked on hard, so I hope to see more in the coming releases.

3. Multicolumn statistics

This feature might be a relief for many of you who know your data and happen to see a lot of underestimated execution plans. A hypothetical example: a business has a class of 1000 children that like counting rhymes. I mean, they really LOVE it. For another crazy reason, they want to store the info about which child said which word. You know your craft, so you don’t even ask why and just get down to coding stuff. Here’s how you might want to store and populate it:

CREATE TABLE counting_log (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, datetime TIMESTAMP WITH TIME ZONE, child_id INTEGER, word TEXT); CREATE TABLE INSERT INTO counting_log(datetime, child_id, word) SELECT current_timestamp, i%1000, CASE WHEN i%4=1 THEN 'eeny' WHEN i%4=2 THEN 'meeny' WHEN i%4=3 THEN 'miny' WHEN i%4=0 THEN 'moe' ELSE 'nope' END FROM generate_series(1, 1000000) i; INSERT 0 1000000 CREATE INDEX idx_counting_log_child_id on counting_log(child_id); CREATE INDEX CREATE INDEX idx_counting_log_datetime on counting_log(datetime); CREATE INDEX 

As soon as you report you are ready, the business comes back to you and asks you to retrieve when the child under the id 123 that said ‘miny’… This is what Postgres would normally do:

  1. Estimate what fraction of total rows (p1) would be returned with the predicate child_id=123
  2. Estimate what fraction of total rows (p2) would be returned with the predicate word = ‘miny’
  3. Assume that the total number of rows using both predicates would be equal to: total_rows*p1*p2

This way of calculating the returned row number might be accurate only when columns in predicates are not correlated. In our case, they quite obviously are, so the estimate would be way lower than the actual result. But with the introduction of correlated statistics, things might change for better. Let’s see how it’s done now:

CREATE STATISTICS st_counting_log_child_id_word ON child_id, word FROM counting_log; CREATE STATISTICS ANALYZE counting_log; ANALYZE 

In these examples, I will be showing things in reverse order. First, let’s see what the estimate is with the help of the newly created statistic:

EXPLAIN SELECT datetime FROM counting_log WHERE child_id=123 AND word='miny'; QUERY PLAN ------------------------------------------------------------------------------------------- Bitmap Heap Scan on counting_log (cost=19.92..2696.34 rows=967 width=8) Recheck Cond: (child_id = 123) Filter: (word = 'miny'::text) -> Bitmap Index Scan on idx_counting_log_child_id (cost=0.00..19.68 rows=967 width=0) Index Cond: (child_id = 123) 

We expect 967 rows to be returned. Ok. How would we do it in the old-fashioned way?

DROP STATISTICS st_counting_log_child_id_word; DROP STATISTICS EXPLAIN SELECT datetime FROM counting_log WHERE child_id=123; QUERY PLAN ------------------------------------------------------------------------------------------- Bitmap Heap Scan on counting_log (cost=19.92..2693.92 rows=967 width=8) Recheck Cond: (child_id = 123) -> Bitmap Index Scan on idx_counting_log_child_id (cost=0.00..19.68 rows=967 width=0) Index Cond: (child_id = 123) EXPLAIN SELECT datetime FROM counting_log WHERE word='miny'; QUERY PLAN --------------------------------------------------------------------- Seq Scan on counting_log (cost=0.00..19695.00 rows=202133 width=8) Filter: (word = 'miny'::text) 

Now, let’s do some math:

returned rows=total rows * p1*p2=1000000*(967/1000000) * (202133/1000000)=195,46 This is what we would expect. And what would Postgres say?

EXPLAIN SELECT datetime FROM counting_log WHERE child_id=123 AND word='miny'; QUERY PLAN ------------------------------------------------------------------------------------------- Bitmap Heap Scan on counting_log (cost=19.73..2696.14 rows=195 width=8) Recheck Cond: (child_id = 123) Filter: (word = 'miny'::text) -> Bitmap Index Scan on idx_counting_log_child_id (cost=0.00..19.68 rows=967 width=0) Index Cond: (child_id = 123) 

Jackpot!

Now there is also one more thing – we are only talking about estimates. What’s the real number of returned rows?

SELECT count(datetime) FROM counting_log WHERE child_id=123 AND word='miny'; count ------- 1000 

To sum up: this example shows that we might achieve a lot if we know our data well. Ignoring the correlation led to a big underestimation – about one order of magnitude! Here it doesn’t look so serious, but consider joining another table that stores data about the children. With such a plan, a nested loop might be considered to be the cheapest, when, in fact, a hash join should be used. If you are interested in seeing what data are stored when collecting multicolumn statistics, just type:

SELECT * FROM pg_statistic_ext WHERE stxname = 'st_counting_log_child_id_word' \gx -[ RECORD 1 ]---+------------------------------ stxrelid | 16555 stxname | st_counting_log_child_id_word stxnamespace | 2200 stxowner | 16385 stxkeys | 3 4 stxkind | {d,f} stxndistinct | {"3, 4": 1000} stxdependencies | {"3 => 4": 1.000000} 

We will not delve into it any deeper, just note that stxkeys correspond to column numbers (you can specify more than two), and two kinds of statistics are collected: stxdistinct and stxdependencies. Oh, and by the way, did you notice that I ended the previous query with \gx instead of a semicolon? Yup, that’s also a new feature 😉

4. More parallelism!

The first attempts to introduce parallel queries had been made in Postgres 9.6. Since then you could use parallel sequential scans, hash and nested loop join strategies and aggregates. From now on, you will also be able to use parallel merge joins, bitmap heap scans and maybe, most importantly, index and index-only scans.

That’s why you can find the new settings for parallel query usage configuration in the Postgresql.conf file:

  • min_parallel_table_scan_size – the minimal size of a table for which parallelism can be triggered, by default 8MB
  • min_parallel_index_scan_size – same as the above but applied to indices, by default 512 kB
  • max_parallel_workers – the maximum number of parallel workers to be used, defaults to 8. Note that in the previous version of Postgres, max_parallel_workers_per_gather was introduced.

What are these settings for? Well, everything comes at a cost. Setting up a parallel worker sometimes might not be worth anything, because it would use more resources than simply running the query without any parallel workers. Again, a query by default might be considered by a planner to be executed with more than one worker, if the estimated cost is higher than 1000, which corresponds to the setting parallel_setup_cost.

From my short experience with Postgres 10, I have a feeling that the new parallel workers are rather shy creatures. Let’s see them in action. It isn’t really an interesting example, but let’s store values of trigonometric functions, just like that:

CREATE TABLE trigonometry AS SELECT i AS arg, sin(i) AS sine, cos(i) AS cosine, tan(i) AS tangent FROM generate_series(0, 100000, 0.01) i; SELECT 10000001 CREATE INDEX idx_trigonometry_arg on trigonometry(arg); CREATE INDEX create index idx_trigonometry_sine on trigonometry(sine); CREATE INDEX create index idx_trigonometry_cosine on trigonometry(cosine); CREATE INDEX 

First, some aggregations that were introduced in version 9.6:

EXPLAIN SELECT count(arg) FROM trigonometry WHERE arg > 50000; QUERY PLAN -------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=140598.88..140598.89 rows=1 width=8) -> Gather (cost=140598.67..140598.88 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=139598.67..139598.68 rows=1 width=8) -> Parallel Seq Scan on trigonometry (cost=0.00..134437.55 rows=2064449 width=8) Filter: (arg > '50000'::numeric) 

Easy. The estimated cost is quite big, so the planner decided to activate two additional workers. Now let’s try to use a parallel index scan.

EXPLAIN SELECT * FROM trigonometry WHERE arg > 50000; QUERY PLAN ----------------------------------------------------------------------------------------------------- Index Scan using idx_trigonometry_arg on trigonometry (cost=0.43..201369.28 rows=4954677 width=32) Index Cond: (arg > '50000'::numeric) 

Nope. What if you helped it a little bit by lowering the cost of adding another worker?

SET parallel_setup_cost=100; SET EXPLAIN SELECT * FROM trigonometry WHERE arg > 50000; QUERY PLAN ----------------------------------------------------------------------------------------------------- Index Scan using idx_trigonometry_arg on trigonometry (cost=0.43..201367.27 rows=4954562 width=32) Index Cond: (arg > '50000'::numeric) 

Still nothing, which is a bit surprising for me. The estimated cost is high, the size of the table and the index is well above the threshold… Let’s try a different query, then:

SET parallel_setup_cost=1000; SET EXPLAIN SELECT arg FROM trigonometry WHERE sine > 0.999 AND arg >100 AND arg < 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Gather (cost=1000.43..40801.85 rows=13403 width=8) Workers Planned: 2 -> Parallel Index Scan using idx_trigonometry_arg on trigonometry (cost=0.43..38461.55 rows=5585 width=8) Index Cond: ((arg > '100'::numeric) AND (arg < '10000'::numeric)) Filter: (sine > '0.999'::double precision) 

Finally, there you are! Now you can see the Parallel Index Scan in all its glory. I ran this query a few of times and the results were retrieved in 162 ms on average. Now let’s play some more. Let’s try to forcefully use the parallelism and limit additional workers to 0.

SET max_parallel_workers =0; SET SET force_parallel_mode=on; SET EXPLAIN ANALYZE SELECT arg FROM trigonometry WHERE sine > 0.999 AND arg >100 AND arg < 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.43..40801.85 rows=13403 width=8) (actual time=0.265..263.796 rows=14097 loops=1) Workers Planned: 2 Workers Launched: 0 -> Parallel Index Scan using idx_trigonometry_arg on trigonometry (cost=0.43..38461.55 rows=5585 width=8) (actual time=0.072..262.133 rows=14097 loops=1) Index Cond: ((arg > '100'::numeric) AND (arg < '10000'::numeric)) Filter: (sine > '0.999'::double precision) Rows Removed by Filter: 975902 Planning time: 0.164 ms Execution time: 264.473 ms 

First of all, you see that you can execute this query. I want to emphasize it here, because it wasn’t that obvious to me that the number of launched workers might be equal to 0, which means that there are no additional workers.

Secondly, if you limit the number of available processes, the planner might want to use more than it actually can. Finally, the results are faster when you use a parallel index scan, because the results for the above query were returned on average after 260ms. All in all, that’s awesome, even though many of us might not even be aware that this feature exists.

5. Full-text search for JSON and JSONB

Last but not least, full-text search support was added for JSON and JSONB type columns. If you’re familiar with how it works on text columns, this won’t be a breakthrough, but it still is a nice feature that comes out-of-the-box in Postgres 10.

Imagine that you are using an external system for payment management. You store most of the data in normal columns but, just to be safe, you keep responses from that system also in JSON columns. This is what it might look like:

CREATE TABLE transaction(id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, transaction_id VARCHAR(10), user_id INTEGER, created_datetime TIMESTAMP WITH TIME ZONE, result BOOL, amount INT ,response_data JSON); CREATE INSERT INTO transaction(transaction_id, user_id, created_datetime, result, amount, response_data) SELECT tran.id, ceil(random()*100), tran.datetime, tran.result, tran.amount, ('{"transaction": {"id": "'|| tran.id ||'", "transaction_datetime": "'|| tran.datetime || '","amount": '|| tran.amount::text || ',"is_success": "'|| tran.result || '", "message": "'|| tran.msg || '"}}')::json FROM ( SELECT substring(md5(random()::text), 1, 10) as id, current_timestamp + (ceil(random()*1000)-500) * INTERVAL '1 minute' as datetime, ceil(random()*1000) as amount, NOT (i%3=1) as result, CASE WHEN i%9=1 THEN 'insufficient funds' WHEN i%9=4 THEN 'blocked account' WHEN i%9=7 THEN 'fraud detected' ELSE 'accepted' END as msg FROM generate_series(1,1000) i) tran; INSERT 0 1000 

This is a small example, but if the table grows considerably, it might be a good idea to create an index for the JSON column. Here’s how we do it:

CREATE INDEX idx_transaction_response_data ON transaction USING GIN (to_tsvector('english', response_data)); CREATE INDEX 

Now that we have some data in the table, let’s take a look how we can query it, based on, for instance, insufficient funds response message. Do we use our new index?

SELECT transaction_id FROM transaction WHERE to_tsvector('english', response_data) @@ to_tsquery('english', 'insufficient') LIMIT 5; transaction_id ---------------- b114b0927f 28613b40c6 649b9d3285 ce9c16ec6f 24f69de12e EXPLAIN SELECT transaction_id FROM transaction WHERE to_tsvector('english', response_data) @@ to_tsquery('english', 'insufficient') LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Limit (cost=8.04..23.12 rows=5 width=11) -> Bitmap Heap Scan on transaction (cost=8.04..23.12 rows=5 width=11) Recheck Cond: (to_tsvector('english'::regconfig, response_data) @@ '''insuffici'''::tsquery) -> Bitmap Index Scan on idx_transaction_response_data (cost=0.00..8.04 rows=5 width=0) Index Cond: (to_tsvector('english'::regconfig, response_data) @@ '''insuffici'''::tsquery) 

Cool, really cool! Note that we needed to convert the word ‘insufficient’ to lexeme ‘insuffici’ to query efficiently.

Ugh… Is this it?

Nope, not even close. As I wrote at the beginning, there is a big hype for Postgres10, as this version introduced a lot of new features. The five novelties above are just the tip of the iceberg, but I chose them, because I believe that I will be using them most often (consciously or not).

So what did I skip?

  • logical replication
  • quorum commit for synchronous replication
  • XML tables
  • SCRAM authentication
  • renaming of some functions
  • \if \elif \else statements in psql

… and many, many more!

For a full list, visit the Postgres wiki. And if you can’t wait for even newer software, visit depesz’s blog, where he has already started describing features from upcoming Postgres 11.

Krzysztof Godlewski
Krzysztof GodlewskiHead of Web

Start a project with 10Clouds

Hire us