Chapter 2. Database Administration

This chapter covers what we consider basic administration of a PostgreSQL server: managing roles and permissions, creating databases, installing extensions, and backing up and restoring data. Before continuing, you should have already installed PostgreSQL and have administration tools at your disposal.

Configuration Files

Three main configuration files control operations of a PostgreSQL server:

postgresql.conf

Controls general settings, such as memory allocation, default storage location for new databases, the IP addresses that PostgreSQL listens on, location of logs, and plenty more.

pg_hba.conf

Controls access to the server, dictating which users can log in to which databases, which IP addresses can connect, and which authentication scheme to accept.

pg_ident.conf

If present, this file maps an authenticated OS login to a PostgreSQL user. People sometimes map the OS root account to the PostgresSQL superuser account, postgres.

Note

PostgreSQL officially refers to users as roles. Not all roles need to have login privileges. For example, group roles often do not. We use the term user to refer to a role with login privileges.

If you accepted default installation options, you will find these configuration files in the main PostgreSQL data folder. You can edit them using any text editor or the Admin Pack in pgAdmin. Instructions for editing with pgAdmin are in “Editing postgresql.conf and pg_hba.conf from pgAdmin3”. If you are unable to find the physical location of these files, run the Example 2-1 query as a superuser while connected to any database.

Example 2-1. Location of configuration files
SELECT name, setting FROM pg_settings WHERE category = 'File Locations';
       name        |                 setting
-------------------+------------------------------------------
 config_file       | /etc/postgresql/9.6/main/postgresql.conf
 data_directory    | /var/lib/postgresql/9.6/main
 external_pid_file | /var/run/postgresql/9.6-main.pid
 hba_file          | /etc/postgresql/9.6/main/pg_hba.conf
 ident_file        | /etc/postgresql/9.6/main/pg_ident.conf
(5 rows)

Making Configurations Take Effect

Some configuration changes require a PostgreSQL service restart, which closes any active connections from clients. Other changes require just a reload. New users connecting after a reload will receive the new setting. Extant users with active connections will not be affected during a reload. If you’re not sure whether a configuration change requires a reload or restart, look under the context setting associated with a configuration. If the context is postmaster, you’ll need a restart. If the context is user, a reload will suffice.

The postgresql.conf File

postgresql.conf controls the life-sustaining settings of the PostgreSQL server. You can override many settings at the database, role, session, and even function levels. You’ll find many details on how to finetune your server by tweaking settings in the article Tuning Your PostgreSQL Server.

Version 9.4 introduced an important change: instead of editing postgresql.conf directly, you should override settings using an additional file called postgresql.auto.conf. We further recommend that you don’t touch the postgresql.conf and place any custom settings in postgresql.auto.conf.

Checking postgresql.conf settings

An easy way to read the current settings without opening the configuration files is to query the view named pg_settings. We demonstrate in Example 2-2.

Example 2-2. Key settings
SELECT
    name,
    context 1,
    unit 2,
    setting, boot_val, reset_val 3
FROM pg_settings
WHERE name IN ('listen_addresses','deadlock_timeout','shared_buffers',
    'effective_cache_size','work_mem','maintenance_work_mem')
ORDER BY context, name;
name                 | context    | unit | setting | boot_val  | reset_val
---------------------+------------+------+-------- +-----------+----------
listen_addresses     | postmaster |      | *       | localhost | *
shared_buffers       | postmaster | 8kB  | 131584  | 1024      | 131584
deadlock_timeout     | superuser  | ms   | 1000    | 1000      | 1000
effective_cache_size | user       | 8kB  | 16384   | 16384     | 16384
maintenance_work_mem | user       | kB   | 16384   | 16384     | 16384
work_mem             | user       | kB   | 5120    | 1024      | 5120
1

The context is the scope of the setting. Some settings have a wider effect than others, depending on their context.

User settings can be changed by each user to affect just that user’s sessions. If set by the superuser, the setting becomes a default for all users who connect after a reload.

Superuser settings can be changed only by a superuser, and will apply to all users who connect after a reload. Users cannot individually override the setting.

Postmaster settings affect the entire server (postmaster represents the PostgreSQL service) and take effect only after a restart.

Settings with user or superuser context can be set for a specific database, user, session, and function level. For example, you might want to set work_mem higher for an SQL guru-level user who writes mind-boggling queries. Similarly, if you have one function that is sort-intensive, you could raise work_mem just for it. Settings set at database, user, session, and function levels do not require a reload. Settings set at the database level take effect on the next connect to the database. Settings set for the session or function take effect right away.

2

Be careful checking the units of measurement used for memory. As you can see in Example 2-2, some are reported in 8-KB blocks and some just in kilobytes. Regardless of how a setting displays, you can use any unit of choice when setting; 128 MB is a versatile choice for most memory settings.

Showing units as 8 KB is annoying at best and is destabilizing at worst. The SHOW command in SQL offers display settings in labeled and more intuitive units. For example, running:

SHOW shared_buffers;

returns 1028MB. Similarly, running:

SHOW deadlock_timeout;

returns 1s. If you want to see the units for all settings, enter SHOW ALL.

3

setting is the current setting; boot_val is the default setting; reset_val is the new setting if you were to restart or reload the server. Make sure that setting and reset_val match after you make a change. If not, the server needs a restart or reload.

New in version 9.5 is a system view called pg_file_settings, which you can use to query settings. Its output lists the source file where the settings can be found. The applied tells you whether the setting is in effect; if the setting has an f in that column you need to reload or restart to make it take effect. In cases where a particular setting is present in both postgresql.conf and postgresql.auto.conf, the postgresql.auto.conf one will take precedent and you’ll see the other files with applied set to false (f). The applied is shown in Example 2-3.

Example 2-3. Querying pg_file_settings
SELECT name, sourcefile, sourceline, setting, applied
FROM pg_file_settings
WHERE name IN ('listen_addresses','deadlock_timeout','shared_buffers',
    'effective_cache_size','work_mem','maintenance_work_mem')
ORDER BY name;
name                 | sourcefile                    | sourceline | setting | applied
---------------------+-------------------------------+------------+---------+--------
effective_cache_size | E:/data96/postgresql.auto.conf|         11 | 8GB     | t
listen_addresses     | E:/data96/postgresql.conf     |         59 | *       | t
maintenance_work_mem | E:/data96/postgresql.auto.conf|          3 | 16MB    | t
shared_buffers       | E:/data96/postgresql.conf     |        115 | 128MB   | f
shared_buffers       | E:/data96/postgresql.auto.conf|          5 | 131584  | t

Pay special attention to the following network settings in postgresql.conf or postgresql.auto.conf, because an incorrect entry here will prevent clients from connecting. Changing their values requires a service restart:

listen_addresses

Informs PostgreSQL which IP addresses to listen on. This usually defaults to local (meaning a socket on the local system), or localhost, meaning the IPv6 or IPv4 localhost IP address. But many people change the setting to *, meaning all available IP addresses.

port

Defaults to 5432. You may wish to change this well-known port to something else for security or if you are running multiple PostgreSQL services on the same server.

max_connections

The maximum number of concurrent connections allowed.

log_destination

This setting is somewhat a misnomer. It specifies the format of the logfiles rather than their physical location. The default is stderr. If you intend to perform extensive analysis on your logs, we suggest changing it to csvlog, which is easier to export to third-party analytic tools. Make sure you have the logging_collection set to on if you want logging.

The following settings affect performance. Defaults are rarely the optimal value for your installation. As soon as you gain enough confidence to tweak configuration settings, you should tune these values:

shared_buffers

Allocated amount of memory shared among all connections to store recently accessed pages. This setting profoundly affects the speed of your queries. You want this setting to be fairly high, probably as much as 25% of your RAM. However, you’ll generally see diminishing returns after more than 8 GB. Changes require a restart.

effective_cache_size

An estimate of how much memory PostgreSQL expects the operating system to devote to it. This setting has no effect on actual allocation, but the query planner figures in this setting to guess whether intermediate steps and query output would fit in RAM. If you set this much lower than available RAM, the planner may forgo using indexes. With a dedicated server, setting the value to half of your RAM is a good starting point. Changes require a reload.

work_mem

Controls the maximum amount of memory allocated for each operation such as sorting, hash join, and table scans. The optimal setting depends on how you’re using the database, how much memory you have to spare, and whether your server is dedicated to PostgreSQL. If you have many users running simple queries, you want this setting to be relatively low to be democratic; otherwise, the first user may hog all the memory. How high you set this also depends on how much RAM you have to begin with. A good article to read for guidance is Understanding work_mem. Changes require a reload.

maintenance_work_mem

The total memory allocated for housekeeping activities such as vacuuming (pruning records marked for deletion). You shouldn’t set it higher than about 1 GB. Reload after changes.

max_parallel_workers_per_gather

This is a new setting introduced in 9.6 for parallelism. The setting determines the maximum parallel worker threads that can be spawned for each gather operation. The default setting is 0, which means parallelism is completely turned off. If you have more than one CPU core, you will want to elevate this. Parallel processing is new in version 9.6, so you may have to experiment with this setting to find what works best for your server. Also note that the number you have here should be less than max_worker_processes, which defaults to 8 because the parallel background worker processes are a subset of the maximum allowed processes.

In version 10, there is an additional setting called max_parallel_workers, which controls the subset of max_worker_processes allocated for parallelization.

The pg_hba.conf File

The pg_hba.conf file controls which IP addresses and users can connect to the database. Furthermore, it dictates the authentication protocol that the client must follow. Changes to the file require at least a reload to take effect. A typical pg_hba.conf looks like Example 2-4.

Example 2-4. Sample pg_hba.conf
# TYPE  DATABASE USER ADDRESS         METHOD
host    all      all  127.0.0.1/32    ident 1
host    all      all  ::1/128         trust 2
host    all      all  192.168.54.0/24 md5 3
hostssl all      all  0.0.0.0/0       md5 4

# TYPE DATABASE    USER     ADDRESS      METHOD
# Allow replication connections from localhost, 
# by a user with replication privilege. 5
#host  replication postgres 127.0.0.1/32 trust
#host  replication postgres ::1/128      trust
1

Authentication method. The usual choices are ident, trust, md5, peer, and password.

2

IPv6 syntax for defining network range. This applies only to servers with IPv6 support and may prevent pg_hba.conf from loading if you add this section without actually having IPv6 networking enabled on the server.

3

IPv4 syntax for defining network range. The first part is the network address followed by the bit mask; for instance: 192.168.54.0/24. PostgreSQL will accept connection requests from any IP address within the range.

4

SSL connection rule. In our example, we allow anyone to connect to our server outside of the allowed IP range as long as they can connect using SSL.

SSL configuration settings can be found in postgres.conf or postgres.auto.conf: ssl, ssl_cert_file, ssl_key_file. Once the server confirms that the client is able to support SSL, it will honor the connection request and all transmissions will be encrypted using the key information.

5

Range of IP addresses allowed to replicate with this server.

For each connection request, pg_hba.conf is checked from the top down. As soon as a rule granting access is encountered, a connection is allowed and the server reads no further in the file. As soon as a rule rejecting access is encountered, the connection is denied and the server reads no further in the file. If the end of the file is reached without any matching rules, the connection is denied. A common mistake people make is to put the rules in the wrong order. For example, if you added 0.0.0.0/0 reject before 127.0.0.1/32 trust, local users won’t be able to connect, even though a rule is in place allowing them to.

New in version 10 is the pg_hba_file_rules system view that lists all the contents of the pg_hba.conf file.

Authentication methods

PostgreSQL gives you many choices for authenticating users—probably more than any other database product. Most people are content with the popular ones: trust, peer, ident, md5, and password. And don’t forget about reject, which immediately denies access. Also keep in mind that pg_hba.conf offers settings at many other levels as the gatekeeper to the entire PostgreSQL server. Users or devices must still satisfy role and database access restrictions after being admitted by pg_hba.conf.

We describe the common authentication methods here:

trust

This is the least secure authentication, essentially no password is needed. As long as the user and database exist in the system and the request comes from an IP within the allowed range, the user can connect. You should implement trust only for local connections or private network connections. Even then it’s possible for someone to spoof IP addresses, so the more security-minded among us discourage its use entirely. Nevertheless, it’s the most common for PostgreSQL installed on a desktop for single-user local access where security is not a concern.

md5

Very common, requires an md5-encrypted password to connect.

password

Uses clear-text password authentication.

ident

Uses pg_ident.conf to check whether the OS account of the user trying to connect has a mapping to a PostgreSQL account. The password is not checked. ident is not available on Windows.

peer

Uses the OS name of the user from the kernel. It is available only for Linux, BSD, macOS, and Solaris, and only for local connections on these systems.

cert

Stipulates that connections use SSL. The client must have a registered certificate. cert uses an ident file such as pg_ident to map the certificate to a PostgreSQL user and is available on all platforms where SSL connection is enabled.

More esoteric options abound, such as gss, radius, ldap, and pam. Some may not always be installed by default.

You can elect more than one authentication method, even for the same database. Keep in mind that pg_hba.conf is processed from top to bottom.

Managing Connections

More often than not, someone else (never you, of course) will execute an inefficient query that ends up hogging resources. They could also run a query that’s taking much longer than what they have patience for. Cancelling the query, terminating the connection, or both will put an end to the offending query.

Cancelling and terminating are far from graceful and should be used sparingly. Your client application should prevent queries from going haywire in the first place. Out of politeness, you probably should alert the connected role that you’re about to terminate its connection, or wait until after hours to do the dirty deed.

There are few scenarios where you should cancel all active update queries: before backing up the database and before restoring the database.

To cancel running queries and terminate connections, follow these steps:

  1. Retrieve a listing of recent connections and process IDs (PIDs):

    SELECT * FROM pg_stat_activity;

    pg_stat_activity is a view that lists the last query running on each connection, the connected user (usename), the database (datname) in use, and the start times of the queries. Review the list to identify the PIDs of connections you wish to terminate.

  2. Cancel active queries on a connection with PID 1234:

    SELECT pg_cancel_backend(1234);

    This does not terminate the connection itself, though.

  3. Terminate the connection:

    SELECT pg_terminate_backend(1234);

    You may need to take the additional step of terminating the client connection. This is especially important prior to a database restore. If you don’t terminate the connection, the client may immediately reconnect after restore and run the offending query anew. If you did not already cancel the queries on the connection, terminating the connection will cancel all of its queries.

PostgreSQL lets you embed functions within a regular SELECT statement. Even though pg_terminate_backend and pg_cancel_backend act on only one connection at a time, you can kill multiple connections by wrapping them in a SELECT. For example, let’s suppose you want to kill all connections belonging to a role with a single blow. Run this SQL command:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE usename = 'some_role';

You can set certain operational parameters at the server, database, user, session, or function level. Any queries that exceed the parameter will automatically be cancelled by the server. Setting a parameter to 0 disables the parameter:

deadlock_timeout

This is the amount of time a deadlocked query should wait before giving up. This defaults to 1000 ms. If your application performs a lot of updates, you may want to increase this value to minimize contention.

Instead of relying on this setting, you can include a NOWAIT clause in your update SQL: SELECT FOR UPDATE NOWAIT ... .

The query will be automatically cancelled upon encountering a deadlock.

In PostgreSQL 9.5, you have another choice: SELECT FOR UPDATE SKIP LOCKED will skip over locked rows.

statement_timeout

This is the amount of time a query can run before it is forced to cancel. This defaults to 0, meaning no time limit. If you have long-running functions that you want cancelled if they exceed a certain time, set this value in the definition of the function rather than globally. Cancelling a function cancels the query and the transaction that’s calling it.

lock_timeout

This is the amount of time a query should wait for a lock before giving up, and is most applicable to update queries. Before data updates, the query must obtain an exclusive lock on affected records. The default is 0, meaning that the query will wait infinitely. This setting is generally used at the function or session level. lock_timeout should be lower than statement_timeout, otherwise statement_timeout will always occur first, making lock_timeout irrelevant.

idle_in_transaction_session_timeout

This is the amount of time a transaction can stay in an idle state before it is terminated. This defaults to 0, meaning it can stay alive infinitely. This setting is new in PostgreSQL 9.6. It’s useful for preventing queries from holding on to locks on data indefinitely or eating up a connection.

Roles

PostgreSQL handles credentialing using roles. Roles that can log in are called login roles. Roles can also be members of other roles; the roles that contain other roles are called group roles. (And yes, group roles can be members of other group roles and so on, but don’t go there unless you have a knack for hierarchical thinking.) Group roles that can log in are called group login roles. However, for security, group roles generally cannot log in. A role can be designated as a superuser. These roles have unfettered access to the PostgreSQL service and should be assigned with discretion.

Warning

Recent versions of PostgreSQL no longer use the terms users and groups. You will still run into these terms; just know that they mean login roles and group roles, respectively. For backward compatibility, CREATE USER and CREATE GROUP still work in current versions, but shun them and use CREATE ROLE instead.

Creating Group Roles

Group roles generally cannot log in. Rather, they serve as containers for other roles. This is merely a best-practice suggestion. Nothing stops you from creating a role that can log in as well as contain other roles.

Create a group role using the following SQL:

CREATE ROLE royalty INHERIT;

Note the use of the modifier INHERIT. This means that any member of royalty will automatically inherit privileges of the royalty role, except for the superuser privilege. For security, PostgreSQL never passes down the superuser privilege. INHERIT is the default, but we recommend that you always include the modifier for clarity.

To refrain from passing privileges from the group to its members, create the role with the NOINHERIT modifier.

To add members to a group role, you would do:

GRANT royalty TO leo;
GRANT royalty TO regina;

Some privileges can’t be inherited. For example, although you can create a group role that you mark as superuser, this doesn’t make its member roles superusers. However, those users can impersonate their group role by using the SET ROLE command, thereby gaining superuser privileges for the duration of the session. For example:

Let’s give the royalty role superuser rights with the command:

ALTER ROLE royalty SUPERUSER;

Although leo is a member of the royalty group and he inherits most rights of royalty, when he logs in, he still will not have superuser rights. He can gain superuser rights by doing:

SET ROLE royalty;

His superuser rights will last only for his current session.

This feature, though peculiar, is useful if you want to prevent yourself from unintentionally doing superuser things while you are logged in.

SET ROLE is a command available to all users, but a more powerful command called SET SESSION AUTHORIZATION is available to people who log in as superusers. In order to understand the differences, we’ll first introduce two global variables that PostgreSQL has called: current_user and session_user. You can see these values when you log in by running the SQL statement:

SELECT session_user, current_user;

When you first log in, the values of these two variables are the same. SET ROLE changes the current_user, while SET SESSION AUTHORIZATION changes both the current_user and session_user variables.

Here are the salient properties of SET ROLE:

A more powerful command, SET SESSION AUTHORIZATION, is available as well. Key features of SET SESSION AUTHORIZATION are as follows:

We’ll do a set of exercises that illustrate the differences between SET ROLE and SET SESSION AUTHORIZATION by first logging in as leo and then running the code in Example 2-7.

Example 2-7. SET ROLE and SET AUTHORIZATION
SELECT session_user, current_user;
 session_user | current_user
--------------+--------------
 leo          | leo
(1 row)
SET SESSION AUTHORIZATION regina;
ERROR:  permission denied to set session authorization
SET ROLE regina;
ERROR:  permission denied to set role "regina"
ALTER ROLE leo SUPERUSER;
ERROR:  must be superuser to alter superusers
SET ROLE royalty;
SELECT session_user, current_user;
 session_user | current_user
--------------+--------------
 leo          | royalty
(1 row)
SET ROLE regina;
ERROR:  permission denied to set role "regina"
ALTER ROLE leo SUPERUSER;
SET ROLE regina;
SELECT session_user, current_user;
 session_user | current_user
--------------+--------------
 leo          | regina
(1 row)
SET SESSION AUTHORIZATION regina;
ERROR:  permission denied to set session authorization
-- After ending session and logging back in as leo
SELECT session_user, current_user;
SET SESSION AUTHORIZATION regina;
SELECT session_user, current_user;
session_user | current_user
--------------+--------------
leo | leo
(1 row)
SET SESSION AUTHORIZATION
session_user | current_user
--------------+--------------
regina | regina
(1 row)

In Example 2-7 leo was unable to use SET SESSION AUTHORIZATION because he’s not a superuser. He was also unable to SET ROLE to regina because he is not in the regina group. However, he was able to SET ROLE royalty since he is a member of the royalty group (he’s a king consort). Even though royalty has superuser rights, he still wasn’t able to impersonate the queen, regina, because his SET ROLE abilities are still based on being the powerless leo. Since royalty is a group that has superuser rights, he was able to promote his own account leo to be a superuser. Once leo is promoted to power, he can then impersonate regina. He is now able to completely take over her session_user and current_user persona with SET SESSION AUTHORIZATION.

Database Creation

The minimum SQL command to create a database is:

CREATE DATABASE mydb;

This creates a copy of the template1 database. Any role with CREATEDB privilege can create new databases.

Template Databases

A template database is, as the name suggests, a database that serves as a skeleton for new databases. When you create a new database, PostgreSQL copies all the database settings and data from the template database to the new database.

The default PostgreSQL installation comes with two template databases: template0 and template1. If you don’t specify a template database to follow when you create a database, template1 is used.

The basic syntax to create a database modeled after a specific template is:

CREATE DATABASE my_db TEMPLATE my_template_db;

You can pick any database to serve as the template. This could come in quite handy when making replicas. You can also mark any database as a template database. Once you do, the database is no longer editable and deletable. Any role with the CREATEDB privilege can use a template database. To make any database a template, run the following SQL as a superuser:

UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'mydb';

If ever you need to edit or drop a template database, first set the datistemplate attribute to FALSE. Don’t forget to change the value back after you’re done with edits.

Using Schemas

Schemas organize your database into logical groups. If you have more than two dozen tables in your database, consider cubbyholing them into schemas. Objects must have unique names within a schema but need not be unique across the database. If you cram all your tables into the default public schema, you’ll run into name clashes sooner or later. It’s up to you how to organize your schemas. For example, if you are an airline, you can place all tables of planes you own and their maintenance records into a planes schema. Place all your crew and staff into an employees schema and place all passenger-related information into a passengers schema.

Another common way to organize schemas is by roles. We found this to be particularly handy with applications that serve multiple clients whose data must be kept separate.

Suppose that you started a dog beauty management business (doggie spa). You start with a table in public called dogs to track all the dogs you hope to groom. You convince your two best friends to become customers. Whimsical government privacy regulation passes, and now you have to put in iron-clad assurances that one customer cannot see dog information from another. To comply, you set up one schema per customer and create the same dogs table in each as follows:

CREATE SCHEMA customer1;
CREATE SCHEMA customer2;

You then move the dog records into the schema that corresponds with the client. The final touch is to create different login roles for each schema with the same name as the schema. Dogs are now completely isolated in their respective schemas. When customers log in to your database to make appointments, they will be able to access only information pertaining to their own dogs.

Wait, it gets better. Because we named our roles to match their respective schemas, we’re blessed with another useful technique. But we must first introduce the search_path database variable.

As we mentioned earlier, object names must be unique within a schema, but you can have same-named objects in different schemas. For example, you have the same table called dogs in all 12 of your schemas. When you execute something like SELECT * FROM dogs, how does PostgreSQL know which schema you’re referring to? The simple answer is to always prepend the schema name onto the table name with a dot, such as in SELECT * FROM customer1.dogs. Another method is to set the search_path variable to be something like customer1, public. When the query executes, the planner searches for the dogs table first in the customer1 schema. If not found, it continues to the public schema and stops there.

PostgreSQL has a little-known variable called user that retrieves the role currently logged in. SELECT user returns this name. user is just an alias for current_user, so you can use either.

Recall how we named our customers’ schemas to be the same as their login roles. We did this so that we can take advantage of the default search path set in postgresql.conf:

search_path = "$user", public;

Now, if role customer1 logs in, all queries will first look in the customer1 schema for the tables before moving to public. Most importantly, the SQL remains the same for all customers. Even if the business grows to have thousands or hundreds of thousands of dog owners, none of the SQL scripts need to change. Commonly shared tables such as common lookup tables can be put in the public schema.

Another practice that we strongly encourage is to create schemas to house extensions (“Step 2: Installing into a database”). When you install an extension, new tables, functions, data types, and plenty of other relics join your server. If they all swarm into the public schema, it gets cluttered. For example, the entire PostGIS suite of extensions will together add thousands of functions. If you’ve already created a few tables and functions of your own in the public schema, imagine how maddening it would be to scan a list of tables and functions trying to find your own among the thousands.

Before you install any extensions, create a new schema:

CREATE SCHEMA my_extensions;

Then add your new schema to the search path:

ALTER DATABASE mydb SET search_path='$user', public, my_extensions;

When you install extensions, be sure to indicate your new schema as their new home.

Privileges

Privileges (often called permissions) can be tricky to administer in PostgreSQL because of the granular control at your disposal. Security can bore down to the column and row level. Yes! You can assign different privileges to each data point of your table, if that ever becomes necessary.

Note

Row-level security (RLS) first appeared in PostgreSQL 9.5. Although RLS is available on all PostgreSQL installations, when used in SELinux, certain advanced features are enabled.

Teaching you all there is to know about privileges could take a few chapters. What we’ll aim for in this section instead is to give you enough information to get up and running and to guide you around some of the more nonintuitive land mines that could either lock you out completely or expose your server inappropriately.

Privilege management in PostgreSQL is no cakewalk. The pgAdmin graphical administration tool can ease some of the tasks or, at the very least, paint you a picture of your privilege settings. You can accomplish most, if not all, of your privilege assignment tasks in pgAdmin. If you’re saddled with the task of administering privileges and are new to PostgreSQL, start with pgAdmin. Jump to “Creating Database Assets and Setting Privileges” if you can’t wait.

GRANT

The GRANT command is the primary means to assign privileges. Basic usage is:

GRANT some_privilege TO some_role;

A few things to keep in mind when it comes to GRANT:

The GRANT command is covered in detail in GRANT. We strongly recommend that you take the time to study this document before you inadvertently knock a big hole in your security wall.

Some privileges are, by default, granted to PUBLIC. These are CONNECT and CREATE TEMP TABLE for databases and EXECUTE for functions. In many cases you might consider revoking some of the defaults with the REVOKE command, as in:

REVOKE EXECUTE ON ALL FUNCTIONS IN SCHEMA my_schema FROM PUBLIC;

Default Privileges

Default privileges ease privilege management by letting you set privileges before their creation.

Let’s suppose we want all users of our database to have EXECUTE and SELECT privileges access to any future tables and functions in a particular schema. We can define privileges as shown in Example 2-8. All roles of a PostgreSQL server are members of the group PUBLIC.

Example 2-8. Defining default privileges on a schema
GRANT USAGE ON SCHEMA my_schema TO PUBLIC; 1

ALTER DEFAULT PRIVILEGES IN SCHEMA my_schema
GRANT SELECT, REFERENCES ON TABLES TO PUBLIC; 2

ALTER DEFAULT PRIVILEGES IN SCHEMA my_schema
GRANT ALL ON TABLES TO mydb_admin WITH GRANT OPTION;  3

ALTER DEFAULT PRIVILEGES IN SCHEMA my_schema  4
GRANT SELECT, UPDATE ON SEQUENCES TO public;

ALTER DEFAULT PRIVILEGES IN SCHEMA my_schema 5
GRANT ALL ON FUNCTIONS TO mydb_admin WITH GRANT OPTION;

ALTER DEFAULT PRIVILEGES IN SCHEMA my_schema 6
GRANT USAGE ON TYPES TO PUBLIC;
1

Allows all users that can connect to the database to also be able to use and create objects in a schema if they have rights to those objects in the schema. GRANT USAGE on a schema is the first step to granting access to objects in the schema. If a user has rights to select from a table in a schema but no USAGE on the schema, then he will not be able to query the table.

2

Grant read and reference rights (the ability to create foreign key constraints against columns in a table) for all future tables created in a schema to all users that have USAGE of the schema.

3

GRANT ALL permissions on future tables to role mydb_admin. In addition, allow members in mydb_admin to be able to grant a subset or all privileges to other users to future tables in this schema. GRANT ALL gives permission to add/update/delete/truncate rows, add triggers, and create constraints on the tables.

456

GRANT permissions on future sequences, functions, and types.

To read more about default privileges, see ALTER DEFAULT PRIVILEGES.

Extensions

Extensions, formerly called contribs, are add-ons that you can install in a PostgreSQL database to extend functionality beyond the base offerings. They exemplify the best of open source software: people collaborating, building, and freely sharing new features. Since version 9.1, the new extension model has made adding extensions a cinch.

Tip

Older add-ons outside the extension model are still called contribs, but with an eye toward the future, we’ll call them all extensions.

Not all extensions need to be in all databases. You should install extensions to your individual database on an as-needed basis. If you want all your databases to have a certain set of extensions, you can develop a template database, as discussed in “Template Databases”, with all the extensions installed, and then beget future databases from that template.

Occasionally prune extensions that you no longer need to avoid bloat. Leaving old extensions you don’t need may cause problems during an in-place upgrade since all extensions you have installed must be also installed in the new PostgreSQL version you are upgrading to.

To see which extensions you have already installed in a database, connect to the database and run the query in Example 2-9. Your list could vary significantly from ours.

Example 2-9. Extensions installed in a database
SELECT name, default_version, installed_version, left(comment,30) As comment
FROM pg_available_extensions
WHERE installed_version IS NOT NULL
ORDER BY name;
     name      | default_version | installed_version |            comment
---------------+-----------------+-------------------+--------------------------------
 btree_gist    | 1.5             | 1.5               | support for indexing common da
 fuzzystrmatch | 1.1             | 1.1               | determine similarities and dis
 hstore        | 1.4             | 1.4               | data type for storing sets of
 ogr_fdw       | 1.0             | 1.0               | foreign-data wrapper for GIS d
 pgrouting     | 2.4.1           | 2.4.1             | pgRouting Extension
 plpgsql       | 1.0             | 1.0               | PL/pgSQL procedural language
 plv8          | 1.4.10          | 1.4.10            | PL/JavaScript (v8) trusted pro
 postgis       | 2.4.0dev        | 2.4.0dev          | PostGIS geometry, geography, a
(8 rows)

If you want to see all the extensions installed on the server, regardless of if they are installed in your current database, leave out the WHERE installed_version IS NOT NULL.

To get more details about a particular extension already installed in your database, enter the following command from psql:

\dx+ fuzzystrmatch

Alternatively, execute the following query:

SELECT pg_describe_object(D.classid,D.objid,0) AS description
FROM pg_catalog.pg_depend AS D INNER JOIN pg_catalog.pg_extension AS E
ON D.refobjid = E.oid
WHERE
D.refclassid = 'pg_catalog.pg_extension'::pg_catalog.regclass AND
deptype = 'e' AND
E.extname = 'fuzzystrmatch';

This shows what’s packaged in the extension:

description
---------------------------------------------------------------------------
function dmetaphone_alt(text)
function dmetaphone(text)
function difference(text,text)
function text_soundex(text)
function soundex(text)
function metaphone(text,integer)
function levenshtein_less_equal(text,text,integer,integer,integer,integer)
function levenshtein_less_equal(text,text,integer)
function levenshtein(text,text,integer,integer,integer)
function levenshtein(text,text)

Extensions can include database assets of all types: functions, tables, data types, casts, languages, operators, etc., but functions usually constitute the bulk of the payload.

Installing Extensions

Getting an extension into your database takes two installation steps. First, download the extension and install it onto your server. Second, install the extension into your database.

We cover both steps in this section as well as how to install on PostgreSQL versions prior to extension support.

Common Extensions

Many extensions come packaged with PostgreSQL but are not installed by default. Some past extensions have gained enough traction to become part of the PostgreSQL core. If you’re upgrading from an ancient version, you may gain functionality without needing any extensions.

Backup and Restore

PostgreSQL ships with three utilities for backup: pg_dump, pg_dumpall, and pg_basebackup. You’ll find all of them in the PostgreSQL bin folder.

Use pg_dump to back up specific databases. To back up all databases in plain text along with server globals, use pg_dumpall, which needs to run under a superuser account so that it back up all databases. Use pg_basebackup to do system-level disk backup of all databases.

For the rest of this section, we’ll focus our discussion on using pg_dump and pg_dumpall. pg_basebackup is the most efficient way of doing a full postgresql server cluster backup. If you have a reasonably sized database, as in 500 GB or more, you should be using pg_basebackup as part of your backup strategy. pg_basebackup, however, requires enabling of features that are often turned off, but that are also needed for replication, so we’ll save discussion of pg_basebackup for “Setting Up Full Server Replication”.

Most of the command-line options for these tools exist both in GNU style (two hyphens plus a word) and the traditional single-letter style (one hyphen plus an alphabetic character). You can use both styles interchangeably, even in the same command. We’ll be covering just the basics here; for a more in-depth discussion, see the PostgreSQL documentation Backup and Restore.

In this section we will not discuss third-party tools that are often used for PostgreSQL backup and restore. Two popular open source ones you might want to consider are pgBackRest and Barman. These offer additional features like backup scheduling, multiserver support, and restore shortcuts.

As you wade through this section, you’ll find that we often specify the port and host in our examples. This is because we often run backups for a different server as scheduled jobs using pgAgent, as discussed in “Job Scheduling with pgAgent”. We often have multiple instances of PostgreSQL running on the same machine, on different ports as well. Sometimes specifying the host can cause problems if your service is set to listen only on localhost. You can safely leave out the host if you are running the examples directly on the server.

You may also want to create a ~/.pgpass file to store all passwords. pg_dump and pg_dumpall don’t have password options. Alternatively, you can set a password in the PGPASSWORD environment variable.

Selective Backup Using pg_dump

For day-to-day backup, pg_dump is more expeditious than pg_dumpall because pg_dump can selectively back up tables, schemas, and databases. pg_dump can back up to plain SQL, as well as compressed, TAR, and directory formats. Compressed, TAR, and directory format backups can take advantage of the parallel restore feature of pg_restore. Directory backups allow parallel pg_dump of a large database. Because we believe you’ll be using pg_dump as part of your daily regimen, we have included a full dump of the help in “Database Backup Using pg_dump” so you can see the myriad switches in a single glance.

The next examples demonstrate a few common backup scenarios and corresponding pg_dump options. They should work for any version of PostgreSQL.

To create a compressed, single database backup:

pg_dump -h localhost -p 5432 -U someuser -F c -b -v -f mydb.backup mydb

To create a plain-text single database backup, including a -C option, which stands for CREATE DATABASE:

pg_dump -h localhost -p 5432 -U someuser -C -F p -b -v -f mydb.backup mydb

To create a compressed backup of tables whose names start with pay in any schema:

pg_dump -h localhost -p 5432 -U someuser -F c -b -v -t *.pay* -f pay.backup mydb

To create a compressed backup of all objects in the hr and payroll schemas:

pg_dump -h localhost -p 5432 -U someuser -F c -b -v \
-n hr -n payroll -f hr.backup mydb

To create a compressed backup of all objects in all schemas, excluding the public schema:

pg_dump -h localhost -p 5432 -U someuser -F c -b -v -N public \
-f all_sch_except_pub.backup mydb

To create a plain-text SQL backup of select tables, useful for porting structure and data to lower versions of PostgreSQL or non-PostgreSQL databases (plain text generates an SQL script that you can run on any system that speaks SQL):

pg_dump -h localhost -p 5432 -U someuser -F p --column-inserts \
-f select_tables.backup mydb

The directory format option was introduced in version PostgreSQL 9.1. This option backs up each table as a separate file in a folder and gets around file size limitations. This option is the only pg_dump backup format option that results in multiple files, as shown in Example 2-10. It creates a new directory and populates it with a gzipped file for each table; also included is a file listing the hierarchy. This backup command exits with an error if the directory already exists.

Example 2-10. Directory format backup
pg_dump -h localhost -p 5432 -U someuser -F d -f /somepath/a_directory mydb

A parallel backup option was introduced in version 9.3 using the --jobs or -j option and specifying the number of jobs. For example: --jobs=3 (-j 3) runs three backups in parallel. Parallel backup makes sense only with the directory format option, because it’s the only backup where multiple files are created. Example 2-11 demonstrates its use.

Systemwide Backup Using pg_dumpall

Use the pg_dumpall utility to back up all databases on a server into a single plain-text file. This comprehensive backup automatically includes server globals such as tablespace definitions and roles. See “Server Backup: pg_dumpall” for a listing of available pg_dumpall command options.

It’s a good idea to back up globals on a daily basis. Although you can use pg_dumpall to back up databases as well, we prefer backing up databases individually using pg_dump or using pg_basebackup to do a PostgreSQL service-level backup. Restoring from a huge plain-text backup tries our patience. Using pg_basebackup in conjunction with streaming replication is the fastest way to recover from major server failure.

To back up all globals and tablespace definitions only, use the following:

pg_dumpall -h localhost -U postgres --port=5432 -f myglobals.sql --globals-only

To back up specific global settings, use the following:

pg_dumpall -h localhost -U postgres --port=5432 -f myroles.sql --roles-only

Restoring Data

There are two ways to restore data in PostgreSQL from backups created with pg_dump or pg_dumpall:

Using pg_restore

If you backed up using pg_dump and chose a format such as TAR, custom, or directory, you have to use the pg_restore utility to restore. pg_restore provides a dizzying array of options, far surpassing the restore utility found in other database products we’ve used. Some of its outstanding features include:

See “Database Restore: pg_restore” for a listing of pg_restore options.

To perform a restore using pg_restore, first create the database anew using SQL:

CREATE DATABASE mydb;

Then restore:

pg_restore --dbname=mydb --jobs=4 --verbose mydb.backup

If the name of the database is the same as the one you backed up, you can create and restore the database in one step:

pg_restore --dbname=postgres --create --jobs=4 --verbose mydb.backup

When you use the --create option, the database name is always the name of the one you backed up. You can’t rename it. If you’re also using the --dbname option, that database name must be different from the name of the database being restored. We usually just specify the postgres database.

Normally, a restore will not re-create objects already present in a database. If you have data in the database, and you want to replace it with what’s in the backup, you need to add the --clean switch to the pg_restore command. This will cause objects to be dropped from the current database so that restore can re-create them.

With PostgreSQL 9.2 or later, you can take advantage of the --section option to restore just the structure without the data. This is useful if you want to use an existing database as a template for a new one. To do so, first create the target database:

CREATE DATABASE mydb2;

Then use pg_restore:

pg_restore --dbname=mydb2 --section=pre-data --jobs=4 mydb.backup

Managing Disk Storage with Tablespaces

PostgreSQL uses tablespaces to ascribe logical names to physical locations on disk. Initializing a PostgreSQL cluster automatically begets two tablespaces: pg_default, which stores all user data, and pg_global, which stores all system data. These are located in the same folder as your default data cluster. You’re free to create tablespaces at will and house them on any serverdisks. You can explicitly assign default tablespaces for new objects by database. You can also move existing database objects to new ones.

Verboten Practices

We have acted as first responders to many PostgreSQL accidents, so we thought it best to end this chapter by itemizing the most common mistakes.

For starters, if you don’t know what you did wrong, the logfile could provide clues. Look for the pg_log folder in your PostgreSQL data folder or the root of the PostgreSQL data folder. It’s also possible that your server shut down before a log entry could be written, in which case the log won’t help you. If your server fails to restart, try the following from the OS command line:

path/to/your/bin/pg_ctl -D your_postgresql_data_folder

Don’t Delete PostgreSQL Core System Files and Binaries

Perhaps this is stating the obvious, but when people run out of disk space, the first thing they do is start deleting files from the PostgreSQL data cluster folder because it’s so darn big. Part of the reason this mistake happens so frequently is that some folders sport innocuous names such as pg_log, pg_xlog, and pg_clog. Yes, there are some files you can safely delete, but unless you know precisely which ones, you could end up destroying your data.

The pg_log folder, often found in your data folder, is a folder that builds up quickly, especially if you have logging enabled. You can always purge files from this folder without harm. In fact, many people schedule jobs to remove logfiles on a regular basis.

Files in the other folders, except for pg_xlog, should never be deleted, even if they have log-sounding names. Don’t even think of touching pg_clog, the active commit log, unless you want to invite disaster.

pg_xlog stores transaction logs. Some systems we’ve seen are configured to move processed transaction logs into a subfolder called archive. You’ll often have an archive folder somewhere (not necessarily as a subfolder of pg_xlog) if you are running synchronous replication, doing continuous archiving, or just keeping logs around in case you need to revert to a different point in time. Deleting files in the root of pg_xlog will mess up the process. Deleting files in the archived folder will just prevent you from performing point-in-time recovery, or if a slave server hasn’t played back the logs, will prevent the slave from fetching them. If these scenarios don’t apply to you, it’s safe to remove files in the archive folder.

Be leery of overzealous antivirus programs, especially on Windows. We’ve seen cases in which antivirus software removed important binaries in the PostgreSQL bin folder. If PostgreSQL fails to start on a Windows system, the event viewer is the first place to look for clues as to why.

Don’t Try to Start PostgreSQL on a Port Already in Use

If you try to start PostgreSQL on a port that’s already in use, you’ll see errors in your pg_log files of the form: make sure PostgreSQL is not already running. Here are the common reasons why this happens: