Architecture of PostgreSQL
Understand the core architecture of PostgreSQL, including its client-server model, tablespaces, shared memory management, and utility processes. Learn how data flows within the system, how transactions ensure consistency, and how components like WAL and checkpointer maintain reliability and performance.
To understand how PostgreSQL stores and retrieves data, it’s important to understand its underlying architecture.
Key components of the PostgreSQL database
The architecture of PostgreSQL includes three major components: the client, the server process, and the database files.
Client: This refers to any program that requests data retrieval or storage with the PostgreSQL database.
Server: Server processes manage all the data stored in PostgreSQL databases. It interacts directly with database files to read and write their data.
Database files: These make up the physical storage for all the data in a PostgreSQL database. They contain information about each table, index, and other database components.
Tablespace
A tablespace is a location on a disk where the database system stores objects like tables, indexes, and other related files. When a database is created, a default tablespace is specified, and all database objects are created in this tablespace unless otherwise specified.
Tablespaces provide a way to organize database objects and manage disk space efficiently. For example, we can create a tablespace on a separate disk partition or drive to improve performance, or create a tablespace with a specific backup policy.
PostgreSQL also provides a special tablespace called pg_global, which is used to store shared system catalogs that are accessible to all the databases in a PostgreSQL cluster. By default, all system catalogs are stored in the pg_global tablespace.
Data flow
The use of transactions controls the flow of data in PostgreSQL. Transactions are a set of SQL commands executed as a unit; either all commands succeed, or none is executed. This makes it possible to ensure the consistency of the data in the database.
Each transaction has an associated transaction ID, which is used to identify it when it’s committed or rolled back. If a transaction is aborted (for example, because of a power failure), its effects are undone, and the database is restored to its state before the transaction began.
The flow of data in PostgreSQL can be summarized as follows:
The client connects to the server and sends a request for a connection.
The server authenticates the client and establishes a connection.
The client sends a request for a transaction.
The server determines whether or not to allow the transaction, and if so, starts a new process to execute it.
The client sends SQL commands to the server.
The server executes the commands and returns the results to the client.
When the client has finished sending commands, it commits or rolls back the transaction.
While transactions are an important part of the data flow in PostgreSQL, other factors also contribute to its stability and efficiency. Some of these include indexes, views, and the PostgreSQL query planner. Overall, the data flow in this powerful database management system is carefully controlled to ensure consistency and reliability.
Shared memory
Work mem is a portion of memory used to store data and indexes for currently accessed tables. When a table is accessed, its data is read from the work mem into memory, which helps to improve performance.
Maintenance work mem is a portion of memory used for maintenance operations such as checking for dead space, updating indexes, and reindexing data. When a maintenance operation is performed, its work mem is temporarily increased to help improve performance.
The shared buffer is a portion of memory used to store data and indexes for tables currently in use. When a table is accessed, its data is read from the shared buffer into memory, which helps to improve performance. The shared buffer size can be controlled using the shared_buffers parameter in the configuration file.
WAL buffers are used to store data and indexes for tables that are currently being updated. When a table is updated, its data is written to WAL buffers instead of the shared buffer. This helps to minimize the impact of updates on performance.
Utility processes
There are several processes involved in the flow of data in PostgreSQL.
Postmaster: This is a process that runs on the server and manages the data flow between the client and the database. It accepts client requests, assigns them to processes, and manages the communication between them and the database.
Postgres: This is the main process that manages data in the database. It receives and executes client queries and retrieves and updates data based on them.
Checkpointer: This process monitors the postgres process to ensure it works correctly. If a problem is detected, it will trigger a failover and cause another server process to take over.
Write-ahead logging (WAL): PostgreSQL uses the WAL system to manage the process of writing dirty buffers to disk. This ensures that data is always written consistently, even if the database crashes unexpectedly.
Dirty buffer: This stores modified or new data that hasn’t yet been written to the disk. It allows the database to continue operating even if the disk is full or unavailable. When the buffer is full, the oldest data is written to the disk by the background writer process.
Background writer: This process periodically writes changes to the disk, which helps prevent database corruption if there’s a power failure or other system problem.
WAL writer: This process writes data to the write-ahead logs (WALs), which helps ensure the consistency of the database by writing changes to the disk before they’re applied to the database. This helps to prevent database corruption in the event of a system failure. They allow the database to recover from a crash without losing any data, which helps minimize downtime. In addition, WALs also help to improve performance by reducing the amount of I/O that’s necessary.
Archiver: This process helps to manage backups.
Autovacuum: This process regularly examines tables for dead space and other inconsistencies.
While many different processes are involved in the flow of data in PostgreSQL, they all work together to ensure that the database remains stable, reliable, and efficient at all times.
Checkpointer
The checkpointer is a background process that periodically performs checkpoints to ensure that all the changes made to the database are safely written to disk. It’s an essential component of the PostgreSQL crash recovery mechanism, as it ensures that the database can be recovered consistently in case of a system failure. The checkpointer process runs in the background and doesn’t usually require any manual intervention. However, database administrators can configure various checkpoint-related parameters to optimize the process's performance and behavior.
Checkpoints can be triggered by various factors, such as a predefined time interval, a certain number of changes made, or a manual request by the administrator. Checkpoint frequency and tuning can significantly impact database performance and recovery time.
Data files
Data files are the physical files that store data for a database. They contain one or more
When a table is created, its associated data and index files are automatically created in the tablespace file. The size of these files depends on the amount of data stored in the table. As the table grows, its associated files also grow.
Data files are important because they contain all the information necessary to rebuild a table. If a table is deleted, its associated data and index files are also deleted. To restore a deleted table, we must first restore its associated data files.
In PostgreSQL, archived WAL files provide a consistent point-in-time view of the data in the database. For example, we can use them to create a replica of our database for testing or development purposes. The WAL writer process creates them and contains a copy of all the data written to the write-ahead log since the last
All components of the PostgreSQL database work together to ensure that the database is stable, reliable, and efficient. Data files also play an important role in storing all the data and indexes.
Logical structure of a database cluster
The logical structure of a PostgreSQL database cluster consists of databases, tables, and other database objects like indexes, views, functions, schemas, users, roles, privileges, etc. All of these components are managed by the PostgreSQL server software, which is responsible for processing queries, managing transactions, and maintaining the integrity of the data in the cluster. The logical structure of a database cluster can be modified using SQL commands, which are used to create, modify, and delete database objects and grant and revoke privileges to users and roles.
Database cluster
In PostgreSQL, a database cluster is a collection of databases managed by a single instance of the server software. The database cluster includes all the files, directories, and configuration settings required to run the PostgreSQL server and manage its databases.
When we install PostgreSQL, we’re actually creating a new database cluster. Each database cluster has a unique name and location on the disk specified during installation. A database cluster can contain multiple databases, each with its own set of tables, indexes, and other objects. The databases within a cluster share the same configuration settings and security permissions but are otherwise independent.
The pg_ctl utility manages the database cluster and is used to start, stop, and manage the PostgreSQL server. The pg_hba.conf file, which controls client authentication and access to the server, is also stored in the database cluster.
It’s important to note that each PostgreSQL instance can only manage a single database cluster simultaneously. If we want to manage multiple database clusters, we must install multiple instances of the PostgreSQL server, each with its configuration files and directories.