Chapter 7. SQL: The PostgreSQL Way

PostgreSQL surpasses other database products in ANSI SQL compliance. It cements its lead by adding constructs that range from convenient syntax shorthands to avant-garde features that break the bounds of traditional SQL. In this chapter, we’ll cover some SQL tidbits not often found in other databases. For this chapter, you should have a working knowledge of SQL; otherwise, you may not appreciate the labor-saving amuse-bouche that PostgreSQL brings to the table.

Views

Well-designed relational databases store data in normalized form. To access this data across scattered tables, you write queries to join underlying tables. When you find yourself writing the same query over and over again, create a view. Simply put, a view is nothing more than a query permanently stored in the database.

Some purists have argued that one should always query a view, never tables. This means you must create a view for every table that you intend to query directly. The added layer of indirection eases management of permissions and facilitates abstraction of table data. We find this to be sound advice, but laziness gets the better of us.

Views in PostgreSQL have evolved over the years. Version 9.3 unveiled automatically updatable views. If your view draws from a single table and you include the primary key as an output column, you can issue an update command directly against your view. Data in the underlying table will follow suit.

Version 9.3 also introduced materialized views. When you mark a view as materialized, it will requery the data only when you issue the REFRESH command. The upside is that you’re not wasting resources running complex queries repeatedly; the downside is that you might not have the most up-to-date data when you use the view. Furthermore, under some circumstances you are barred from access to the view during a refresh.

Version 9.4 allows users to access materialized views during refreshes. It also introduced the WITH CHECK OPTION modifier, which prevents inserts and updates outside the scope of the view.

Single Table Views

The simplest view draws from a single table. Always include the primary key if you intend to write data back to the table, as shown in Example 7-1.

Example 7-1. Single table view

CREATE OR REPLACE VIEW census.vw_facts_2011 AS
SELECT fact_type_id, val, yr, tract_id FROM census.facts WHERE yr = 2011;

As of version 9.3, you can alter the data in this view by using INSERT, UPDATE, or DELETE commands. Updates and deletes will abide by any WHERE condition you have as part of your view. For example, the following query will delete only records whose value is 0:

DELETE FROM census.vw_facts_2011 WHERE val = 0;

And the following will not update any records, because the view explicitly includes only records for 2011:

UPDATE census.vw_facts_2011 SET val = 1 WHERE yr = 2012;

Be aware that you can insert data that places it outside of the view’s WHERE or update data so it is no longer visible from the view as shown in Example 7-2.

Example 7-2. View update that results in data no longer visible in view

UPDATE census.vw_facts_2011 SET yr = 2012 WHERE yr = 2011;

The update of Example 7-2 does not violate the WHERE condition. But, once executed, you would have emptied your view. For the sake of sanity, you may find it desirable to prevent updates or inserts that leave data invisible to further queries. Version 9.4 introduced the WITH CHECK OPTION to accomplish this. Include this modifier when creating the view and PostgreSQL will forever balk at any attempts to add records outside the view and to update records that will put them outside the view. In our example view, our goal is to limit vw_facts_2011 to allow inserts only of 2011 data and disallow updates of the yr to something other than 2011. To add this restriction, we revise our view definition as shown in Example 7-3.

Example 7-3. Single table view WITH CHECK OPTION

CREATE OR REPLACE VIEW census.vw_facts_2011 AS
SELECT fact_type_id, val, yr, tract_id FROM census.facts 
WHERE yr = 2011 WITH CHECK OPTION;

Now try to run an update such as:

UPDATE census.vw_facts_2011 SET yr = 2012 WHERE val > 2942;

You’ll get an error:

ERROR: New row violates WITH CHECK OPTION for view "vw_facts_2011"
DETAIL: Failing row contains (1, 25001010500, 2012, 2985.000, 100.00).

Using Triggers to Update Views

Views can encapsulate joins among tables. When a view draws from more than one table, updating the underlying data with a simple command is no longer possible. Drawing data from more than one table introduces inherent ambiguity when you’re trying to update the underlying data, and PostgreSQL is not about to make an arbitrary decision for you. For instance, if you have a view that joins a table of countries with a table of provinces, and then decide to delete one of the rows, PostgreSQL won’t know whether you intend to delete only a country, a province, or a particular country-province pairing. Nonetheless, you can still modify the underlying data through the view using triggers.

Let’s start by creating a view that pulls rows from the facts table and a lookup table, as shown in Example 7-4.

Example 7-4. Creating view vw_facts

CREATE OR REPLACE VIEW census.vw_facts AS
SELECT
	y.fact_type_id, y.category, y.fact_subcats, y.short_name,
	x.tract_id, x.yr, x.val, x.perc
FROM census.facts As x INNER JOIN census.lu_fact_types As y
ON x.fact_type_id = y.fact_type_id;

To make this view updatable with a trigger, you can define one or more INSTEAD OF triggers. We first define the trigger function to handle the trifecta: INSERT, UPDATE, DELETE. In addition, PostgreSQL supports triggers on the TRUNCATE event. You can use any language to write the function except SQL, and you’re free to name it whatever you like. We chose PL/pgSQL in Example 7-5.

Example 7-5. Trigger function for vw_facts to insert, update, delete

CREATE OR REPLACE FUNCTION census.trig_vw_facts_ins_upd_del() RETURNS trigger AS
$$
BEGIN
    IF (TG_OP = 'DELETE') THEN 
        DELETE FROM census.facts AS f
        WHERE
            f.tract_id = OLD.tract_id AND f.yr = OLD.yr AND
            f.fact_type_id = OLD.fact_type_id;
        RETURN OLD;
    END IF;
    IF (TG_OP = 'INSERT') THEN 
        INSERT INTO census.facts(tract_id, yr, fact_type_id, val, perc)
        SELECT NEW.tract_id, NEW.yr, NEW.fact_type_id, NEW.val, NEW.perc;
        RETURN NEW;
    END IF;
    IF (TG_OP = 'UPDATE') THEN 
        IF
            ROW(OLD.fact_type_id, OLD.tract_id, OLD.yr, OLD.val, OLD.perc) !=
            ROW(NEW.fact_type_id, NEW.tract_id, NEW.yr, NEW.val, NEW.perc)
        THEN 
            UPDATE census.facts AS f
            SET
                tract_id = NEW.tract_id,
                yr = NEW.yr,
                fact_type_id = NEW.fact_type_id,
                val = NEW.val,
                perc = NEW.perc
            WHERE
                f.tract_id = OLD.tract_id AND
                f.yr = OLD.yr AND
                f.fact_type_id = OLD.fact_type_id;
            RETURN NEW;
        ELSE
            RETURN NULL;
        END IF;
    END IF;
END;
$$
LANGUAGE plpgsql VOLATILE;

: Handles deletes. Delete only records with matching keys in the OLD record.
: Handles inserts.
: Handles updates. Use the OLD record to determine which records to update. NEW record has the new data.
: Update rows only if at least one of the columns from the facts table has changed.

Next, we bind the trigger function to the view, as shown in Example 7-6.

Example 7-6. Bind trigger function to view

CREATE TRIGGER trig_01_vw_facts_ins_upd_del
INSTEAD OF INSERT OR UPDATE OR DELETE ON census.vw_facts
FOR EACH ROW EXECUTE PROCEDURE census.trig_vw_facts_ins_upd_del();

The binding syntax is uncharacteristically English-like.

Now when we update, delete, or insert into our view, we update the underlying facts table instead:

UPDATE census.vw_facts SET yr = 2012 
WHERE yr = 2011 AND tract_id = '25027761200';

Upon a successful update, PostgreSQL returns the following message:

Query returned successfully: 56 rows affected, 40 ms execution time.

If we try to update a field not in our update row comparison, the update will not take place:

UPDATE census.vw_facts SET short_name = 'test';

With a message:

Query returned successfully: 0 rows affected, 931 ms execution time.

Although this example created a single trigger function to handle multiple events, we could have just as easily created a separate trigger and trigger function for each event.

PostgreSQL has another approach for updating views called rules, which predates the introduction of INSTEAD OF triggers view support. You can see an example using rules in Database Abstraction with Updatable Views.

You can still use rules to update view data, but INSTEAD OF triggers are preferred now. Internally PostgreSQL still uses rules to define the view (a view is nothing but an INSTEAD OF SELECT rule on a virtual table) and to implement single table updatable views. The difference between using a trigger and a rule is that a rule rewrites the underlying query and a trigger gets called for each virtual row. As such, rules become overwhelmingly difficult to write (and understand) when many tables are involved. Rules are also limited because they can be written only in SQL, not in other procedural languages.

Materialized Views

Materialized views cache the fetched data. This happens when you first create the view as well as when you run the REFRESH MATERIALIZED VIEW command. To use materialized views, you need at least version 9.3.

The most convincing cases for using materialized views are when the underlying query takes a long time and when having timely data is not critical. You often encounter these scenarios when building online analytical processing (OLAP) applications.

Unlike nonmaterialized views, you can add indexes to materialized views to speed up the read.

Example 7-7 demonstrates how to make a materialized version of the view in Example 7-1.

Example 7-7. Materialized view

CREATE MATERIALIZED VIEW census.vw_facts_2011_materialized AS
SELECT fact_type_id, val, yr, tract_id FROM census.facts WHERE yr = 2011;

Create an index on a materialized view as you would do on a regular table, as shown in Example 7-8.

Example 7-8. Add index to materialized view

CREATE UNIQUE INDEX ix
ON census.vw_facts_2011_materialized (tract_id, fact_type_id, yr);

For speedier access to a materialized view with a large number of records, you may want to control the physical sort of the data. The easiest way is to include an ORDER BY when you create the view. Alternatively, you can add a cluster index to the view. First, create an index in the physical sort order you want to have. Then run the CLUSTER command, passing it the index, as shown in Example 7-9.

Example 7-9. Clustering and reclustering a view on an index

CLUSTER census.vw_facts_2011_materialized USING ix; 
CLUSTER census.vw_facts_2011_materialized;

: Name the index to cluster on. Needed only during view creation.
: Each time you refresh, you must recluster the data.

The advantage of using ORDER BY in the materialized view over using the CLUSTER approach is that the sort is maintained with each REFRESH MATERIALIZED VIEW call, alleviating the need to recluster. The downside is that ORDER BY generally adds more processing time to the REFRESH step of the view. You should test the effect of ORDER BY on performance of REFRESH before using it. One way to test is just to run the underlying query of the view with an ORDER BY clause.

To refresh the view in PostgreSQL 9.3, use:

REFRESH MATERIALIZED VIEW census.vw_facts_2011_materialized;

The view cannot be queried while the REFRESH MATERIALIZED VIEW step is running.

In PostgreSQL 9.4, to allow the view to be queried while it’s refreshing, you can use:

REFRESH MATERIALIZED VIEW CONCURRENTLY census.vw_facts_2011_materialized;

Current limitations of materialized views include:

You can’t use CREATE OR REPLACE to edit an existing materialized view. You must drop and re-create the view even for the most trivial of changes. Use DROP MATERIALIZED VIEW name_of_view. Annoyingly, you’ll lose all your indexes.
You need to run REFRESH MATERIALIZED VIEW to rebuild the cache. PostgreSQL doesn’t perform automatic recaching of any kind. You need to resort to mechanisms such as crontab, pgAgent jobs, or triggers to automate any kind of refresh. We have an example using triggers in Caching Data with Materialized Views and Statement-Level Triggers.
Refreshing materialized views in version 9.3 is a blocking operation, meaning that the view will not be accessible during the refresh process. In version 9.4 you can lift this quarantine by adding the CONCURRENTLY keyword to your REFRESH command, provided that you have established a unique index on your view. The trade-off is concurrent refreshes could take longer to complete.

Handy Constructions

In our many years of writing SQL, we have come to appreciate the little things that make better use of our typing. Only PostgreSQL offers some of the gems we present in this section. Often this means that the construction is not ANSI-compliant. If thy God demands strict observance to the ANSI SQL standards, abstain from the short-cuts that we’ll be showing.

DISTINCT ON

One of our favorites is DISTINCT ON. It behaves like DISTINCT, but with two enhancements: you can specify which columns to consider as distinct and to sort the remaining columns. One little word—ON—replaces numerous lines of additional code to achieve the same result.

In Example 7-10, we demonstrate how to get the details of the first tract for each county.

Example 7-10. DISTINCT ON

SELECT DISTINCT ON (left(tract_id, 5))
    left(tract_id, 5) As county, tract_id, tract_name
FROM census.lu_tracts
ORDER BY county, tract_id;

county |  tract_id   | tract_name
-------+-------------+---------------------------------------------------
25001  | 25001010100 | Census Tract 101, Barnstable County, Massachusetts
25003  | 25003900100 | Census Tract 9001, Berkshire County, Massachusetts
25005  | 25005600100 | Census Tract 6001, Bristol County, Massachusetts
25007  | 25007200100 | Census Tract 2001, Dukes County, Massachusetts
25009  | 25009201100 | Census Tract 2011, Essex County, Massachusetts
:
(14 rows)

The ON modifier accepts multiple columns, considering all of them to determine distinctness. The ORDER BY clause has to start with the set of columns in the DISTINCT ON; then you can follow with your preferred ordering.

LIMIT and OFFSET

LIMIT returns only the number of rows indicated; OFFSET indicates the number of rows to skip. You can use them in tandem or separately. You almost always use them in conjunction with an ORDER BY. In Example 7-11, we demonstrate use of a positive offset. Leaving out the offset yields the same result as setting the offset to zero.

Limits and offsets are not unique to PostgreSQL and are in fact copied from MySQL, although implementation differs widely among database products.

Example 7-11. First tract for counties 2 through 5

SELECT DISTINCT ON (left(tract_id, 5))
    left(tract_id, 5) As county, tract_id, tract_name
FROM census.lu_tracts
ORDER BY county, tract_id LIMIT 3 OFFSET 2;

county | tract_id    | tract_name
-------+-------------+-------------------------------------------------
25005  | 25005600100 | Census Tract 6001, Bristol County, Massachusetts
25007  | 25007200100 | Census Tract 2001, Dukes County, Massachusetts
25009  | 25009201100 | Census Tract 2011, Essex County, Massachusetts
(3 rows)

Shorthand Casting

ANSI SQL defines a construct called CAST that allows you to morph one data type to another. For example, CAST('2011-1-11' AS date) casts the text 2011-1-1 to a date. PostgreSQL has shorthand for doing this, using a pair of colons, as in '2011-1-1'::date. This syntax is shorter and easier to apply for cases in which you can’t directly cast from one type to another and have to intercede with one or more intermediary types, such as someXML::text::integer.

Multirow Insert

PostgreSQL supports the multirow constructor to insert more than one record at a time. Example 7-12 demonstrates how to use a multirow construction to insert data into the table we created in Example 6-3.

Example 7-12. Using a multirow constructor to insert data

INSERT INTO logs_2011 (user_name, description, log_ts)
VALUES
    ('robe', 'logged in', '2011-01-10 10:15 AM EST'),
    ('lhsu', 'logged out', '2011-01-11 10:20 AM EST');

The latter portion of the multirow constructor, starting with the VALUES keyword, is often referred to as a values list. A values list can stand alone and effectively creates a table on the fly, as in Example 7-13.

Example 7-13. Using a multirow constructor as a virtual table

SELECT *
FROM (
    VALUES
        ('robe', 'logged in', '2011-01-10 10:15 AM EST'::timestamptz),
        ('lhsu', 'logged out', '2011-01-11 10:20 AM EST'::timestamptz)
    ) AS l (user_name, description, log_ts);

When you use VALUES as a stand-in for a virtual table, you need to specify the names for the columns. You also need to explicitly cast the values to the data types in the table if the parser can’t infer the data type from the data. The multirow VALUES construct also exists in MySQL and SQL Server.

ILIKE for Case-Insensitive Search

PostgreSQL is case-sensitive. However, it does have mechanisms in place to ignore casing. You can apply the upper function to both sides of the ANSI LIKE operator, or you can simply use the ILIKE (~~*) operator:

SELECT tract_name FROM census.lu_tracts WHERE tract_name ILIKE '%duke%';

tract_name
------------------------------------------------
Census Tract 2001, Dukes County, Massachusetts
Census Tract 2002, Dukes County, Massachusetts
Census Tract 2003, Dukes County, Massachusetts
Census Tract 2004, Dukes County, Massachusetts
Census Tract 9900, Dukes County, Massachusetts

ANY Array Search

PostgreSQL has a construct c alled ANY that can be used in conjunction with arrays, combined with a comparator operator or comparator keyword. If any element of the array matches a row, that row is returned.

Here is an example:

SELECT tract_name
FROM census.lu_tracts
WHERE tract_name ILIKE ANY(ARRAY['%99%duke%','%06%Barnstable%']::text[]);

tract_name
-----------------------------------------------------
Census Tract 102.06, Barnstable County, Massachusetts
Census Tract 103.06, Barnstable County, Massachusetts
Census Tract 106, Barnstable County, Massachusetts
Census Tract 9900, Dukes County, Massachusetts
(4 rows)

The example just shown is a shorthand way of using multiple ILIKE OR clauses. You can use ANY with other comparators such as LIKE, =, and ~ (the regex like operator).

ANY can be used with any data types and comparison operators (operators that return a Boolean), including ones you built yourself or installed via extensions.

Set-Returning Functions in SELECT

A set-returning function is a function that could return more than one row.

PostgreSQL allows set-returning functions to appear in the SELECT clause of an SQL statement. This is not true of most other databases, in which only scalar functions can appear in the SELECT.

Interweaving some set-returning functions into an already complicated query could produce results beyond what you expect, because these functions usually result in the creation of new rows. You must anticipate this if you’ll be using the results as a subquery. In Example 7-14, we demonstrate row creation resulting from using a temporal version of generate_series. The example uses a table that we construct with:

CREATE TABLE interval_periods (i_type interval);
INSERT INTO interval_periods (i_type)
VALUES ('5 months'), ('132 days'), ('4862 hours');

Example 7-14. Set-returning function in SELECT

SELECT i_type,
    generate_series('2012-01-01'::date,'2012-12-31'::date,i_type) As dt
FROM interval_periods;

i_type     | dt
-----------+-----------------------
5 months   | 2012-01-01 00:00:00-05
5 months   | 2012-06-01 00:00:00-04
5 months   | 2012-11-01 00:00:00-04
132 days   | 2012-01-01 00:00:00-05
132 days   | 2012-05-12 00:00:00-04
132 days   | 2012-09-21 00:00:00-04
4862 hours | 2012-01-01 00:00:00-05
4862 hours | 2012-07-21 15:00:00-04

Restricting DELETE, UPDATE, and SELECT from Inherited Tables

When you query from a table that has child tables, the query automatically drills down into the children, creating a union of all the child records satisfying the query condition. DELETE and UPDATE work the same way, drilling down the hierarchy for victims. Sometimes this is not desirable because you want data to come only from the table you specified, without the kids tagging along.

This is where the ONLY keyword comes in handy. We show an example of its use in Example 7-37, where we want to delete only those records from the production table that haven’t migrated to the log table. Without the ONLY modifier, we’d end up deleting records from the child table that might have already been moved previously.

DELETE USING

Often, when you delete data from a table, you’ll want to delete the data based on its presence in another set of data. Specify this additional set with the USING predicate. Then, in the WHERE clause, you can use both datasets in the USING and in the FROM to define conditions for deletion. Multiple tables can be included in USING, separated by commas. Example 7-15 deletes all records from census.facts that correspond to a fact type of short_name = 's01'.

Example 7-15. DELETE USING

DELETE FROM census.facts
USING census.lu_fact_types As ft
WHERE facts.fact_type_id = ft.fact_type_id AND ft.short_name = 's01';

The standards-compliant way would be to use a clunkier IN expression in the WHERE.

Returning Affected Records to the User

The RETURNING predicate is supported by ANSI SQL standards but not commonly found in other relational databases. We show an example in Example 7-37, where we return the records deleted. RETURNING can also be used for inserts and updates. For inserts into tables with a serial key, RETURNING is invaluable because it returns the key value of the new rows—something you wouldn’t know prior to the query execution. Although RETURNING is often accompanied by * for all fields, you can limit the fields as we do in Example 7-16.

Example 7-16. Returning changed records of an UPDATE with RETURNING

UPDATE census.lu_fact_types AS f
SET short_name = replace(replace(lower(f.fact_subcats[4]),' ','_'),':','')
WHERE f.fact_subcats[3] = 'Hispanic or Latino:' AND f.fact_subcats[4] > ''
RETURNING fact_type_id, short_name;

fact_type_id | short_name
-------------+-------------------------------------------------
96           | white_alone
97           | black_or_african_american_alone
98           | american_indian_and_alaska_native_alone
99           | asian_alone
100          | native_hawaiian_and_other_pacific_islander_alone
101          | some_other_race_alone
102          | two_or_more_races

UPSERTs: INSERT ON CONFLICT UPDATE

New in version 9.5 is the INSERT ON CONFLICT construct, which is often referred to as an UPSERT. This feature is useful if you don’t know a record already exists in a table and rather than having the insert fail, you want it to either update the existing record or do nothing.

This feature requires a unique key, primary key, unique index, or exclusion constraint in place, that when violated, you’d want different behavior like updating the existing record or not doing anything. To demonstrate, imagine we have a table of colors to create:

CREATE TABLE colors(color varchar(50) PRIMARY KEY, hex varchar(6));
    INSERT INTO colors(color, hex) 
    VALUES('blue', '0000FF'), ('red', 'FF0000');

We then get a new batch of colors to add to our table, but some may be present already. If we do a regular insert, we’d get a primary key violation when we tried to add colors already in the table. When we run Example 7-17, we get only one record inserted, the green that is not already in our table, and each subsequent run would result in no records being inserted.

Example 7-17. ON CONFLICT DO NOTHING

INSERT INTO colors(color, hex) 
   VALUES('blue', '0000FF'), ('red', 'FF0000'), ('green', '00FF00') 
ON CONFLICT DO NOTHING ;

Someone could come and put in a different case 'Blue' in our system, and we’d then have two different cased blues. To remedy this, we can put a unique index on our table:

CREATE UNIQUE INDEX uidx_colors_lcolor ON colors USING btree(lower(color));

As before, if we tried to insert a 'Blue', we’d be prevented from doing so and the ON CONFLICT DO NOTHING would result in nothing happening. If we really wanted to spell the colors as given to us, we could use code like that given in Example 7-18.

Example 7-18. ON CONFLICT DO UPDATE

INSERT INTO colors(color, hex) 
  VALUES('Blue', '0000FF'), ('Red', 'FF0000'), ('Green', '00FF00') 
ON CONFLICT(lower(color))  
    DO UPDATE SET color = EXCLUDED.color, hex = EXCLUDED.hex;

In Example 7-18 we specified the conflict, which matches the expression of a constraint or unique index, so using something like upper(color) would not work since the colors table has no matching index for that expression.

In the case of INSERT ON CONFLICT DO UPDATE, you need to specify the conflicting condition or CONSTRAINT name. If using a constraint, you’d use ON CONFLICT ON CONSTRAINT constraint_name_here as shown in Example 7-19.

Example 7-19. ON CONFLICT DO UPDATE

INSERT INTO colors(color, hex)
        VALUES('Blue', '0000FF'), ('Red', 'FF0000'), ('Green', '00FF00')
ON CONFLICT ON CONSTRAINT colors_pkey
DO UPDATE SET color = EXCLUDED.color, hex = EXCLUDED.hex;;

The DO part of the INSERT construct will only happen if there is a primary key, unique index, or unique key constraint error triggered. However, errors such as data type ones or check constraints will fail and never be processed by DO UPDATE.

Composite Types in Queries

PostgreSQL automatically creates data types of all tables. Because data types derived from tables contain other data types, they are often called composite data types, or just composites. The first time you see a query with composites, you might be surprised. In fact, you might come across their versatility by accident when making a typo in an SQL statement. Try the following query:

SELECT x FROM census.lu_fact_types As x LIMIT 2;

At first glance, you might think that we left out a .* by accident, but check out the result:

x
------------------------------------------------------------------
(86,Population,"{D001,Total:}",d001)
(87,Population,"{D002,Total:,""Not Hispanic or Latino:""}",d002)

Instead of erroring out, the preceding example returns the canonical representation of a lu_fact_type data type. Composites can serve as input to several useful functions, among which are array_agg and hstore (a function packaged with the hstore extension that converts a row into a key-value pair object).

If you are building web applications, you can take advantage of the built-in JSON and JSONB support we covered in “JSON” and use a combination of array_agg and array_to_json to output a query as a single JSON object as shown in Example 7-20. In PostgreSQL 9.4, you can use json_agg. See Example 7-21.

Example 7-20. Query to JSON output

SELECT array_to_json(array_agg(f)) As cat 
FROM (
    SELECT MAX(fact_type_id) As max_type, category 
    FROM census.lu_fact_types
    GROUP BY category
) As f;

This will give you an output of:

cats
----------------------------------------------------
[{"max_type":102,"category":"Population"},
{"max_type":153,"category":"Housing"}]

: Defines a subquery with name f. f can then be used to reference each row in the subquery.
: Aggregate each row of subquerying using array_agg and then convert the array to json with array_to_json.

In version 9.3, the json_agg function replaces the chain of array_to_json and array_agg, offering both convenience and speed. In Example 7-21, we repeat Example 7-20 using json_agg, and both examples will have the same output.

Example 7-21. Query to JSON using json_agg

SELECT json_agg(f) As cats
FROM (
    SELECT MAX(fact_type_id) As max_type, category
    FROM census.lu_fact_types
    GROUP BY category
) As f;

Dollar Quoting

In standard ANSI SQL, single quotes (') surround string literals. Should you have a single quote in the string itself, such as last names like O’Nan, possesives like mon’s place, or contractions like can’t, you need to escape it with another. The escape character is another single quote placed in front of the single quote you’re trying to escape. Say you’re writing an insert statement where you copied a large passage from a novel. Affixing yet another single quote to all existing single quotes is both tedious to add and challenging to read. After all, two single quotes look awfully like one double quote, which is another character entirely.

PostgreSQL lets you escape single quotes in strings of any length by surrounding them with two sequential dollar signs ($$), hence the name dollar quoting.

Dollar quoting is also useful in situations where you’re trying to execute a piece of SQL dynamically, such as exec(some sql). In Example 7-5, we enclosed the body of a trigger using dollar quoting.

If you are writing an SQL statement that glues two sentences with many single quotes, the ANSI standard way would be to escape as in the following:

SELECT 'It''s O''Neil''s play. ' || 'It''ll start at two o''clock.'

With dollar quoting:

SELECT $$It's O'Neil's play. $$ || $$It'll start at two o'clock.$$

The pair of dollar signs replaces the single quote and escapes all single quotes within.

A variant of dollar quoting is named dollar quoting. We cover this in the following section.

DO

The DO command allows you to inject a piece of procedural code into your SQL on the fly. You can think of it as a one-time anonymous function. As an example, we’ll load the data collected in Example 3-10 into production tables from our staging table. We’ll use PL/pgSQL for our procedural snippet, but you’re free to use other languages.

First, we’ll create the table:

set search_path=census;
DROP TABLE IF EXISTS lu_fact_types CASCADE;
CREATE TABLE lu_fact_types (
    fact_type_id serial,
    category varchar(100),
    fact_subcats varchar(255)[],
    short_name varchar(50),
    CONSTRAINT pk_lu_fact_types PRIMARY KEY (fact_type_id)
);

Then we’ll use DO to populate it as shown in Example 7-22. CASCADE will force the drop of any related objects such as foreign key constraints and views, so be cautious when using CASCADE.

Example 7-22 generates a series of INSERT INTO SELECT statements. The SQL also performs an unpivot operation to convert columnar data into rows.

Warning

Example 7-22 is only a partial listing of the code needed to build lu_fact_types. For the full code, refer to the building_census_tables.sql file that is part of the book code and data download.

Example 7-22. Using DO to generate dynamic SQL

DO language plpgsql
$$
DECLARE var_sql text;
BEGIN
    var_sql := string_agg(
        $sql$ 
        INSERT INTO lu_fact_types(category, fact_subcats, short_name)
        SELECT
            'Housing',
            array_agg(s$sql$ || lpad(i::text,2,'0')
              || ') As fact_subcats,'
              || quote_literal('s' || lpad(i::text,2,'0')) || ' As short_name
        FROM staging.factfinder_import
        WHERE s' || lpad(I::text,2,'0') || $sql$ ~ '^[a-zA-Z]+' $sql$, ';'
    )
    FROM generate_series(1,51) As I; 
    EXECUTE var_sql; 
END
$$;

: Use of dollar quoting, so we don’t need to escape ' in Housing. Since the DO command is also wrapped in dollars, we need to use a named $ delimiter inside. We chose $sql$ .
: Use string_agg to form a set of SQL statements as a single string of the form INSERT INTO lu_fact_type(...) SELECT ... WHERE s01 ~ '[a-zA-Z]+';
: Execute the SQL.

In Example 7-22, we are using the dollar-quoting syntax covered in “Dollar Quoting” for the body of the DO function and some fragments of the SQL statements inside the function. Since we use dollar quoting to define the whole body of the DO as well as internally, we need to use named dollar quoting for at least one part. The same dollar-quoting nested approach can be used for functon definitions as well.

FILTER Clause for Aggregates

New in version 9.4 is the FILTER clause for aggregates, recently standardized in ANSI SQL. This replaces the standard CASE WHEN clause for reducing the number of rows included in an aggregation. For example, suppose you used CASE WHEN to break out average test scores by student, as shown in Example 7-23.

Example 7-23. CASE WHEN used in AVG

SELECT student,
    AVG(CASE WHEN subject ='algebra' THEN score ELSE NULL END) As algebra,
    AVG(CASE WHEN subject ='physics' THEN score ELSE NULL END) As physics
FROM test_scores
GROUP BY student;

The FILTER clause equivalent for Example 7-23 is shown in Example 7-24.

Example 7-24. FILTER used with AVG aggregate

SELECT student,
    AVG(score) FILTER (WHERE subject ='algebra') As algebra,
    AVG(score) FILTER (WHERE subject ='physics') As physics
FROM test_scores
GROUP BY student;

In the case of averages and sums and many other aggregates, the CASE and FILTER are equivalent. The benefit is that FILTER is a little clearer in purpose and for large datasets is faster. However, there are some aggregates—such as array_agg, which considers NULL fields—where the CASE statement gives you extra NULL values you don’t want. In Example 7-25 we try to get the list of scores for each subject of interest for each student using the CASE .. WHEN.. approach.

Example 7-25. CASE WHEN used in array_agg

SELECT student,
    array_agg(CASE WHEN subject ='algebra' THEN score ELSE NULL END) As algebra,
    array_agg(CASE WHEN subject ='physics' THEN score ELSE NULL END) As physics
FROM test_scores
GROUP BY student;

student | algebra                   | physics
--------+---------------------------+-------------------------------
jojo    | {74,NULL,NULL,NULL,74,..} | {NULL,83,NULL,NULL,NULL,79,..}
jdoe    | {75,NULL,NULL,NULL,78,..} | {NULL,72,NULL,NULL,NULL,72..}
robe    | {68,NULL,NULL,NULL,77,..} | {NULL,83,NULL,NULL,NULL,85,..}
lhsu    | {84,NULL,NULL,NULL,80,..} | {NULL,72,NULL,NULL,NULL,72,..}
(4 rows)

Observe that in Example 7-25 we get a bunch of NULL fields in our arrays. We could work around this issue with some clever use of subselects, but most of those will be more verbose and slower than the FILTER alternative shown in Example 7-26.

Example 7-26. FILTER used with array_agg

SELECT student,
    array_agg(score) FILTER (WHERE subject ='algebra') As algebra,
    array_agg(score) FILTER (WHERE subject ='physics') As physics
FROM test_scores
GROUP BY student;

student | algebra | physics
--------+---------+--------
jojo    | {74,74} | {83,79}
jdoe    | {75,78} | {72,72}
robe    | {68,77} | {83,85}
lhsu    | {84,80} | {72,72}

FILTER works for all aggregate functions, not just aggregate functions built into PostgreSQL.

Percentiles and Mode

New in PostgreSQL 9.4 are statistical functions for computing percentile, median (aka .5 percentile), and mode. These functions are percentile_disc (percentile discrete), percentile_cont (percentile continuous), and mode.

The two percentile functions differ in how they handle even counts. For the discrete function, the first value encountered is taken, so the ordering of the data matters. For the continuous case, values within the same percentile are averaged.

Median is merely the .5 percentile; therefore, it does not deserve a separate function of its own. The mode function finds the most common value. Should there be more than one mode, the first one encountered is returned; therefore, ordering matters, as shown in Example 7-27.

Example 7-27. Compute median and mode scores

SELECT
    student,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY score) As cont_median,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY score) AS disc_median,
    mode() WITHIN GROUP (ORDER BY score) AS mode,
    COUNT(*) As num_scores
FROM test_scores
GROUP BY student
ORDER BY student;

student | cont_median | disc_median | mode | num_scores
--------+-------------+-------------+------+------------
alex    |          78 |          77 |   74 |          8
leo     |          72 |          72 |   72 |          8
regina  |          76 |          76 |   68 |          9
sonia   |        73.5 |          72 |   72 |          8
(4 rows)

Example 7-27 computes both the discrete and the continuous median score, which could differ when students have an even number of scores.

The inputs of these functions differ from other aggregate functions. The column being aggregated is the column in the ORDER BY clauses of the WITHIN GROUP modifiers. The column is not direct input to the function, as we’re used to seeing.

The percentile functions have another variant that accepts an array of percentiles, letting you retrieve multiple percentiles all in one call. Example 7-28 computes the median, the 60 percentile, and the highest score.

Example 7-28. Compute multiple percentiles

SELECT
    student,
    percentile_cont('{0.5,0.60,1}'::float[]) 
WITHIN GROUP (ORDER BY score) AS cont_median,
    percentile_disc('{0.5,0.60,1}'::float[]) 
WITHIN GROUP (ORDER BY score) AS disc_median,
COUNT(*) As num_scores
FROM test_scores
GROUP BY student
ORDER BY student;

student | cont_median    | disc_median | num_scores
--------+----------------+-------------+------------
alex    | {78,79.2,84}   | {77,79,84}  |          8
leo     | {72,73.6,84}   | {72,72,84}  |          8
regina  | {76,76.8,90}   | {76,77,90}  |          9
sonia   | {73.5,75.6,86} | {72,75,86}  |          8
(4 rows)

As with all aggregates, you can combine these functions with modifiers. Example 7-29 combines WITHIN GROUP with FILTER.

Example 7-29. Compute median score for two subjects

SELECT
    student,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY score) 
        FILTER (WHERE subject = 'algebra') AS algebra,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY score) 
        FILTER (WHERE subject = 'physics') AS physics
FROM test_scores
GROUP BY student
ORDER BY student;

student | algebra | physics
--------+---------+--------
alex    |      74 |      79
leo     |      80 |      72
regina  |      68 |      83
sonia   |      75 |      72
(4 rows)

Window Functions

Window functions are a common ANSI SQL feature. A window function has the prescience to see and use data beyond the current row; hence the term window. A window defines which other rows need to be considered in addition to the current row. Windows let you add aggregate information to each row of your output where the aggregation involves other rows in the same window. Window functions such as row_number and rank are useful for ordering your data in sophisticated ways that use rows outside the selected results but within a window.

Without window functions, you’d have to resort to using joins and subqueries to poll neighboring rows. On the surface, window functions violate the set-based principle of SQL, but we mollify the purist by claiming that they are merely shorthand. You can find more details and examples in Window Functions.

Example 7-30 gives you a quick start. Using a window function, we can obtain both the detail data and the average value for all records with fact_type_id of 86 in one single SELECT. Note that the WHERE clause is always evaluated before the window function.

Example 7-30. The basic window

SELECT tract_id, val, AVG(val) OVER () as val_avg
FROM census.facts
WHERE fact_type_id = 86;

tract_id    | val      | val_avg
------------+----------+----------------------
25001010100 | 2942.000 | 4430.0602165087956698
25001010206 | 2750.000 | 4430.0602165087956698
25001010208 | 2003.000 | 4430.0602165087956698
25001010304 | 2421.000 | 4430.0602165087956698
:

The OVER sets the boundary of the window. In this example, because the parentheses contain no constraint, the window covers all the rows in our WHERE. So the average is calculated across all rows with fact_type_id = 86. The clause also morphed our conventional AVG aggregate function into a window aggregate function. For each row, PostgreSQL submits all the rows in the window to the AVG aggregation and outputs the value as part of the row. Because our window has multiple rows, the result of the aggregation is repeated. Notice that with window functions, we were able to perform an aggregation without GROUP BY. Furthermore, we were able to rejoin the aggregated result back with the other variables without using a formal join.

You can use all SQL aggregate functions as window functions. In addition, you’ll find ROW, RANK, LEAD, and others listed in Window Functions.

PARTITION BY

You can run a window function over rows containing particular values instead of using the whole table. This requires the addition of a PARTITION BY clause, which instructs PostgreSQL to take the aggregate over the indicated rows. In Example 7-31, we repeat what we did in Example 7-30 but partition our window by county code, which is always the first five characters of the tract_id column. Thus, the rows in each county code are averaged separately.

Example 7-31. Partitioning our window by county code

SELECT tract_id, val, AVG(val) OVER (PARTITION BY left(tract_id,5)) As val_avg_county
FROM census.facts
WHERE fact_type_id = 2 ORDER BY tract_id;

tract_id    | val      | val_avg_county
------------+----------+----------------------
25001010100 | 1765.000 | 1709.9107142857142857
25001010206 | 1366.000 | 1709.9107142857142857
25001010208 |  984.000 | 1709.9107142857142857
:
25003900100 | 1920.000 | 1438.2307692307692308
25003900200 | 1968.000 | 1438.2307692307692308
25003900300 | 1211.000 | 1438.2307692307692308
:

ORDER BY

Window functions also allow an ORDER BY in the OVER clause. Without getting too abstruse, the best way to think about this is that all the rows in the window will be ordered as indicated by ORDER BY, and the window function will consider only rows that range from the first row in the window up to and including the current row in the window or partition. The classic example uses the ROW_NUMBER function to sequentially number rows. In Example 7-32, we demonstrate how to number our census tracts in alphabetical order. To arrive at the row number, ROW_NUMBER counts all rows up to and including the current row based on the order dictated by the ORDER BY.

Example 7-32. Numbering using the ROW_NUMBER window function

SELECT ROW_NUMBER() OVER (ORDER BY tract_name) As rnum, tract_name
FROM census.lu_tracts
ORDER BY rnum LIMIT 4;

rnum | tract_name
-----+-------------------------------------------------
1    | Census Tract 1, Suffolk County, Massachusetts
2    | Census Tract 1001, Suffolk County, Massachusetts
3    | Census Tract 1002, Suffolk County, Massachusetts
4    | Census Tract 1003, Suffolk County, Massachusetts

In Example 7-32, we also have an ORDER BY for the entire query. Don’t get confused between this and the ORDER BY that’s specific to the window function.

You can combine ORDER BY with PARTITION BY, restarting the ordering for each partition. Example 7-33 returns to our example of county codes.

Example 7-33. Combining PARTITION BY and ORDER BY

SELECT tract_id, val,
    SUM(val) OVER (PARTITION BY left(tract_id,5) ORDER BY val) As sum_county_ordered
FROM census.facts
WHERE fact_type_id = 2
ORDER BY left(tract_id,5), val;

tract_id    | val     | sum_county_ordered
-------------+----------+-----------------
25001014100 | 226.000 |            226.000
25001011700 | 971.000 |           1197.000
25001010208 | 984.000 |           2181.000
:
25003933200 | 564.000 |            564.000
25003934200 | 593.000 |           1157.000
25003931300 | 606.000 |           1763.000
:

The key observation to make in the output is how the sum changes from row to row. The ORDER BY clause means that the sum will be taken only from the beginning of the partition to the current row, giving you a running total, where the location of the current row in the list is dictated by the ORDER BY clause. For instance, if your row is in the fifth row in the third partition, the sum will cover only the first five rows in the third partition. We put an ORDER BY left(tract_id,5), val at the end of the query so you can easily see the pattern, but keep in mind that the ORDER BY of the query is independent of the ORDER BY in each OVER clause.

You can explicitly control the rows under consideration by adding a RANGE or ROWS clause: ROWS BETWEEN CURRENT ROW AND 5 FOLLOWING.

PostgreSQL also supports window naming, which is useful if you have the same window for each of your window columns. Example 7-34 demonstrates how to name windows as well as how to use the LEAD and LAG window functions to show a record value before and after for a given partition.

Example 7-34. Naming windows, demonstrating LEAD and LAG

SELECT * FROM (
    SELECT
        ROW_NUMBER() OVER( wt ) As rnum, 
        substring(tract_id,1, 5) As county_code,
        tract_id,
        LAG(tract_id,2) OVER wt As tract_2_before,
        LEAD(tract_id) OVER wt As tract_after
    FROM census.lu_tracts
    WINDOW wt AS (PARTITION BY substring(tract_id,1, 5) ORDER BY tract_id) 
) As x
WHERE rnum BETWEEN 2 and 3 AND county_code IN ('25007','25025')
ORDER BY county_code, rnum;

rnum | county_code |  tract_id   | tract_2_before | tract_after
-----+-------------+-------------+----------------+------------
2    | 25007       | 25007200200 |                | 25007200300
3    | 25007       | 25007200300 | 25007200100    | 25007200400
2    | 25025       | 25025000201 |                | 25025000202
3    | 25025       | 25025000202 | 25025000100    | 25025000301

: Naming our window wt window.
: Using our window name instead of retyping.

Both LEAD and LAG take an optional step argument that defines how many rows to skip forward or backward; the step can be positive or negative. LEAD and LAG return NULL when trying to retrieve rows outside the window partition. This is a possibility that you always have to account for.

In PostgreSQL, any aggregate function you create can be used as a window function. Other databases tend to limit window functions to using built-in aggregates such as AVG, SUM, MIN, and MAX.

Common Table Expressions

Essentially, common table expressions (CTEs) allow you to define a query that can be reused in a larger query. CTEs act as temporary tables defined within the scope of the statement; they’re gone once the enclosing statement has finished executing.

There are three ways to use CTEs:

Basic CTE: This is your plain-vanilla CTE, used to make your SQL more readable or to encourage the planner to materialize a costly intermediate result for better performance.
Writable CTE: This is an extension of the basic CTE with UPDATE, INSERT, and DELETE commands. A common final step in the CTE is to return changed rows.
Recursive CTE: This puts an entirely new whirl on standard CTE. The rows returned by a recursive CTE vary during the execution of the query.

PostgreSQL allows you to have a CTE that is both writable and recursive.

Basic CTEs

The basic CTE looks like Example 7-35. The WITH keyword introduces the CTE.

Example 7-35. Basic CTE

WITH cte AS (
    SELECT
        tract_id, substring(tract_id,1, 5) As county_code,
        COUNT(*) OVER(PARTITION BY substring(tract_id,1, 5)) As cnt_tracts
    FROM census.lu_tracts
)
SELECT MAX(tract_id) As last_tract, county_code, cnt_tracts
FROM cte
WHERE cnt_tracts > 100
GROUP BY county_code, cnt_tracts;

cte is the name of the CTE in Example 7-35, defined using a SELECT statement to contain three columns: tract_id, county_code, and cnt_tracts. The main SELECT refers to the CTE.

You can stuff as many CTEs as you like, separated by commas, into the WITH clause, as shown in Example 7-36. The order of the CTEs matters in that CTEs defined later can call CTEs defined earlier, but not vice versa.

Example 7-36. Multiple CTEs

WITH
    cte1 AS (
        SELECT
            tract_id,
            substring(tract_id,1, 5) As county_code,
            COUNT(*) OVER (PARTITION BY substring(tract_id,1,5)) As cnt_tracts
        FROM census.lu_tracts
    ),
    cte2 AS (
        SELECT
            MAX(tract_id) As last_tract,
            county_code,
            cnt_tracts
        FROM cte1
        WHERE cnt_tracts < 8 GROUP BY county_code, cnt_tracts
    )
SELECT c.last_tract, f.fact_type_id, f.val
FROM census.facts As f INNER JOIN cte2 c ON f.tract_id = c.last_tract;

Writable CTEs

The writable CTE extends the CTE to allow for update, delete, and insert statements. We’ll revisit our logs tables that we created in Example 6-3, adding another child table and populating it:

CREATE TABLE logs_2011_01_02 (
    PRIMARY KEY (log_id),
    CONSTRAINT chk
        CHECK (log_ts >= '2011-01-01' AND log_ts < '2011-03-01')
)
INHERITS (logs_2011);

In Example 7-37, we move data from our parent 2011 table to our new child Jan-Feb 2011 table. The ONLY keyword is described in “Restricting DELETE, UPDATE, and SELECT from Inherited Tables” and the RETURNING keyword in “Returning Affected Records to the User”.

Example 7-37. Writable CTE moving data from one branch to another

WITH t AS (
    DELETE FROM ONLY logs_2011 WHERE log_ts < '2011-03-01' RETURNING *
)
INSERT INTO logs_2011_01_02 SELECT * FROM t;

Recursive CTE

The official documentation for PostgreSQL describes it best: “The optional RECURSIVE modifier changes CTE from a mere syntactic convenience into a feature that accomplishes things not otherwise possible in standard SQL.” A more interesting CTE is one that uses a recursively defining construct to build an expression. PostgreSQL recursive CTEs utilize UNION ALL to combine tables, a kind of combination that can be done repeatedly as the query adds the tables over and over.

To turn a basic CTE to a recursive one, add the RECURSIVE modifier after the WITH. WITH RECURSIVE can contain a mix of recursive and nonrecursive table expressions. In most other databases, the RECURSIVE keyword is not necessary to denote recursion.

A common use of recursive CTEs is to represent message threads and other tree-like structures. We have an example of this in Recursive CTE to Display Tree Structures.

In Example 7-38, we query the system catalog to list the cascading table relationships we have in our database.

Example 7-38. Recursive CTE

WITH RECURSIVE tbls AS (
    SELECT
        c.oid As tableoid,
        n.nspname AS schemaname,
        c.relname AS tablename 
    FROM
        pg_class c LEFT JOIN
        pg_namespace n ON n.oid = c.relnamespace LEFT JOIN
        pg_tablespace t ON t.oid = c.reltablespace LEFT JOIN
        pg_inherits As th ON th.inhrelid = c.oid
    WHERE
        th.inhrelid IS NULL AND
        c.relkind = 'r'::"char" AND c.relhassubclass
    UNION ALL
    SELECT
        c.oid As tableoid,
        n.nspname AS schemaname,
        tbls.tablename || '->' || c.relname AS tablename   
    FROM
        tbls INNER JOIN
        pg_inherits As th ON th.inhparent = tbls.tableoid INNER JOIN
        pg_class c ON th.inhrelid = c.oid LEFT JOIN
        pg_namespace n ON n.oid = c.relnamespace LEFT JOIN
    pg_tablespace t ON t.oid = c.reltablespace
)
SELECT * FROM tbls ORDER BY tablename;

tableoid | schemaname | tablename
---------+------------+---------------------------------------
3152249  | public     | logs
3152260  | public     | logs->logs_2011
3152272  | public     | logs->logs_2011->logs_2011_01_02

: Get a list of all tables that have child tables but no parent table.
: This is the recursive part; it gets all children of tables in tbls.
: The names of the child tables start with the parental name.
: Return parents and all child tables. Because we sort by table name, which prepends the parent name, all child tables will follow their parents in their output.

Lateral Joins

LATERAL is a new ANSI SQL construction in version 9.3. Here’s the motivation behind it: suppose you perform joins on two tables or subqueries; normally, the pair participating in the join are independent units and can’t read data from each other. For example, the following interaction would generate an error because l.yr = 2011 is not a column on the righthand side of the join:

SELECT *
    FROM
        census.facts L
        INNER JOIN
        (
            SELECT *
            FROM census.lu_fact_types
            WHERE category = CASE WHEN L.yr = 2011 
THEN 'Housing' ELSE category END
        ) R
    ON L.fact_type_id = R.fact_type_id;

Now add the LATERAL keyword, and the error is gone:

SELECT *
    FROM
        census.facts L INNER JOIN LATERAL
        (
            SELECT *
            FROM census.lu_fact_types
            WHERE category = CASE WHEN L.yr = 2011 
THEN 'Housing' ELSE category END
        ) R
        ON L.fact_type_id = R.fact_type_id;

LATERAL lets you share data in columns across two tables in a FROM clause. However, it works only in one direction: the righthand side can draw from the lefthand side, but not vice versa.

There are situations when you should avail yourself of LATERAL to avoid extremely convoluted syntax. In Example 7-39, a column on the left serves as a parameter in the generate_series function on the right:

CREATE TABLE interval_periods(i_type interval);
INSERT INTO interval_periods (i_type)
VALUES ('5 months'), ('132 days'), ('4862 hours');

Example 7-39. Using LATERAL with generate_series

SELECT i_type, dt
FROM
    interval_periods CROSS JOIN LATERAL
    generate_series('2012-01-01'::date, '2012-12-31'::date, i_type) AS dt
WHERE NOT (dt = '2012-01-01' AND i_type = '132 days'::interval);

i_type      | dt
------------+-----------------------
 5 mons     | 2012-01-01 00:00:00-05
 5 mons     | 2012-06-01 00:00:00-04
 5 mons     | 2012-11-01 00:00:00-04
 132 days   | 2012-05-12 00:00:00-04
 132 days   | 2012-09-21 00:00:00-04
 4862:00:00 | 2012-01-01 00:00:00-05
 4862:00:00 | 2012-07-21 15:00:00-04

Lateral is also helpful for using values from the lefthand side to limit the number of rows returned from the righthand side. Example 7-40 uses LATERAL to return, for each superuser who has used our site within the last 100 days, the last five logins and what they were up to. Tables used in this example were created in “TYPE OF” and “Basic Table Creation”.

Example 7-40. Using LATERAL to limit rows from a joined table

SELECT u.user_name, l.description, l.log_ts
FROM
    super_users AS u CROSS JOIN LATERAL (
    SELECT description, log_ts
    FROM logs
    WHERE
        log_ts > CURRENT_TIMESTAMP - interval '100 days' AND
        logs.user_name = u.user_name
    ORDER BY log_ts DESC LIMIT 5
    ) AS l;

Although you can achieve the same results by using window functions, lateral joins yield faster results with more succinct syntax.

You can use multiple lateral joins in your SQL and even chain them in sequence as you would when joining more than two subqueries. You can sometimes get away with omitting the LATERAL keyword; the query parser is smart enough to figure out a lateral join if you have a correlated expression. But we advise that you always include the keyword for the sake of clarity. Also, you’ll get an error if you write your statement assuming the use of a lateral join but run the statement on a prelateral version PostgreSQL. Without the keyword, PostgreSQL might end up performing a join with unintended results.

Other database products also offer lateral joins, although they don’t abide by the ANSI moniker. In Oracle, you’d use a table pipeline construct. In SQL Server, you’d use CROSS APPLY or OUTER APPLY.

WITH ORDINALITY

Introduced in version 9.4, the WITH ORDINALITY clause is an SQL ANSI standard construct. WITH ORDINALITY adds a sequential number column to a set-returning function result.

Note

Although you can’t use WITH ORDINALITY with tables and subqueries, you can achieve the same result for those by using the window function ROW_NUMBER.

You’ll find WITH ORDINALITY often used with functions like generate_series, unnest, and other functions that expand out composite types and arrays. It can be used with any set-returning function, including ones you create yourself.

Example 7-41 demonstrates WITH ORDINALITY used in conjunction with the temporal variant of the generate_series function.

Example 7-41. Numbering results from set-returning functions

SELECT dt.*
FROM generate_series('2016-01-01'::date,'2016-12-31'::date,interval '1 month')
WITH ORDINALITY As dt;

dt                     | ordinality
-----------------------+-----------
2016-01-01 00:00:00-05 |          1
2016-02-01 00:00:00-05 |          2
2016-03-01 00:00:00-05 |          3
2016-04-01 00:00:00-04 |          4
2016-05-01 00:00:00-04 |          5
2016-06-01 00:00:00-04 |          6
2016-07-01 00:00:00-04 |          7
2016-08-01 00:00:00-04 |          8
2016-09-01 00:00:00-04 |          9
2016-10-01 00:00:00-04 |         10
2016-11-01 00:00:00-04 |         11
2016-12-01 00:00:00-05 |         12
(12 rows)

WITH ORDINALITY always adds an additional column at the end of the result called ordinality, and WITH ORDINALITY can only appear in the FROM clause of an SQL statement. You are free to rename the ordinality column.

You’ll often find WITH ORDINALITY paired with the LATERAL construct. In Example 7-42 we repeat the LATERAL in Example 7-39, but add on a sequential number to each set.

Example 7-42. Using WITH ORDINALITY with LATERAL

SELECT d.ord, i_type, d.dt
FROM
    interval_periods CROSS JOIN LATERAL
    generate_series('2012-01-01'::date, '2012-12-31'::date, i_type)
WITH ORDINALITY AS d(dt,ord)
WHERE NOT (dt = '2012-01-01' AND i_type = '132 days'::interval);

ord | i_type     | dt
----+------------+-----------------------
  1 | 5 mons     | 2012-01-01 00:00:00-05
  2 | 5 mons     | 2012-06-01 00:00:00-04
  3 | 5 mons     | 2012-11-01 00:00:00-04
  2 | 132 days   | 2012-05-12 00:00:00-04
  3 | 132 days   | 2012-09-21 00:00:00-04
  1 | 4862:00:00 | 2012-01-01 00:00:00-05
  2 | 4862:00:00 | 2012-07-21 15:00:00-04
(7 rows)

In Example 7-42, WITH ORDINALITY gets applied to the result of the set-returning function. It always gets applied before the WHERE condition. As a result, there is a gap in numbering in the final result (the number 1 is lacking for the 132 day interval), because the number was filtered out by our WHERE condition.

If we didn’t have the WHERE condition excluding the 2012-01-01, 132 day record, we would have 8 rows with the 4th row being 1 | 132 days | 2012-01-01 00:00:00-04

GROUPING SETS, CUBE, ROLLUP

If you’ve ever tried to create a summary report that includes both totals and subtotals, you’ll appreciate the capability to partition your data on the fly. Grouping sets let you do exactly that.

For our table of test scores, if we need to find both the overall average per student and the average per student by subject, we could write a query as shown in Example 7-43, taking advantage of grouping sets.

Example 7-43. Avg score for each student and student in subject

SELECT student, subject, AVG(score)::numeric(10,2)
FROM test_scores
WHERE student IN ('leo','regina')
GROUP BY GROUPING SETS ((student),(student,subject))
ORDER BY student, subject NULLS LAST;

 student |  subject  |  avg
---------+-----------+-------
 leo     | algebra   | 82.00
 leo     | calculus  | 65.50
 leo     | chemistry | 75.50
 leo     | physics   | 72.00
 leo     | NULL      | 73.75
 regina  | algebra   | 72.50
 regina  | calculus  | 64.50
 regina  | chemistry | 73.50
 regina  | economics | 90.00
 regina  | physics   | 84.00
 regina  | NULL      | 75.44
(11 rows)

In a single query, Example 7-43 gives us both the average of each student across all subjects and his or her average in each subject.

We can even include a total for each subject across all students by having multiple grouping sets as shown in Example 7-44.

Example 7-44. Avg score for each student, student in subject, and subject

SELECT student, subject, AVG(score)::numeric(10,2)
FROM test_scores
WHERE student IN ('leo','regina')
GROUP BY GROUPING SETS ((student,subject),(student),(subject))
ORDER BY student NULLS LAST, subject NULLS LAST;

 student |  subject  |  avg
---------+-----------+-------
 leo     | algebra   | 82.00
 leo     | calculus  | 65.50
 leo     | chemistry | 75.50
 leo     | physics   | 72.00
 leo     | NULL      | 73.75
 regina  | algebra   | 72.50
 regina  | calculus  | 64.50
 regina  | chemistry | 73.50
 regina  | economics | 90.00
 regina  | physics   | 84.00
 regina  | NULL      | 75.44
 NULL    | algebra   | 77.25
 NULL    | calculus  | 65.00
 NULL    | chemistry | 74.50
 NULL    | economics | 90.00
 NULL    | physics   | 78.00
(16 rows)

What if we wanted to have total breakdowns for student, student plus subject, and overall average? We could revise our query to add a universal grouping set GROUPING SETS ((student),(student, subject),()). This is equivalent to the shorthand ROLLUP (student, subject). See Example 7-45.

Example 7-45. Avg score for each student in subject, student, and overall

SELECT student, subject, AVG(score)::numeric(10,2)
FROM test_scores
WHERE student IN ('leo','regina')
GROUP BY ROLLUP (student,subject)
ORDER BY student NULLS LAST, subject NULLS LAST;

 student |  subject  |  avg
---------+-----------+-------
 leo     | algebra   | 82.00
 leo     | calculus  | 65.50
 leo     | chemistry | 75.50
 leo     | physics   | 72.00
 leo     | NULL      | 73.75
 regina  | algebra   | 72.50
 regina  | calculus  | 64.50
 regina  | chemistry | 73.50
 regina  | economics | 90.00
 regina  | physics   | 84.00
 regina  | NULL      | 75.44
 NULL    | NULL      | 74.65
(12 rows)

If we reverse the order of columns in ROLLUP, we get the score for each student/subject pair, average for each subject, and overall average as shown in Example 7-46.

Example 7-46. Avg score for each student in subject, subject, and overall

SELECT student, subject, AVG(score)::numeric(10,2)
FROM test_scores
WHERE student IN ('leo','regina')
GROUP BY ROLLUP (subject,student)
ORDER BY student NULLS LAST, subject NULLS LAST;

 student |  subject  |  avg
---------+-----------+-------
 leo     | algebra   | 82.00
 leo     | calculus  | 65.50
 leo     | chemistry | 75.50
 leo     | physics   | 72.00
 regina  | algebra   | 72.50
 regina  | calculus  | 64.50
 regina  | chemistry | 73.50
 regina  | economics | 90.00
 regina  | physics   | 84.00
 NULL    | algebra   | 77.25
 NULL    | calculus  | 65.00
 NULL    | chemistry | 74.50
 NULL    | economics | 90.00
 NULL    | physics   | 78.00
 NULL    | NULL      | 74.65
(15 rows)

If we also wanted to include subtotals for just the subject and just the student, we’d use GROUPING SETS ( (student), (student, subject), (subject), () ), or the shorthand CUBE (student, subject) in Example 7-47.

Example 7-47. Avg score for each student, student in subject, subject, and overall

SELECT student, subject, AVG(score)::numeric(10,2)
FROM test_scores
WHERE student IN ('leo','regina')
GROUP BY  CUBE (student, subject)
ORDER BY student NULLS LAST, subject NULLS LAST;

 student |  subject  |  avg
---------+-----------+-------
 leo     | algebra   | 82.00
 leo     | calculus  | 65.50
 leo     | chemistry | 75.50
 leo     | physics   | 72.00
 leo     | NULL      | 73.75
 regina  | algebra   | 72.50
 regina  | calculus  | 64.50
 regina  | chemistry | 73.50
 regina  | economics | 90.00
 regina  | physics   | 84.00
 regina  | NULL      | 75.44
 NULL    | algebra   | 77.25
 NULL    | calculus  | 65.00
 NULL    | chemistry | 74.50
 NULL    | economics | 90.00
 NULL    | physics   | 78.00
 NULL    | NULL      | 74.65
(17 rows)