FACTORS AFFECTING CRASH RECOVERY MECHANISM IN POSTGRESQL

Our expectation from OLTP databases is that the system will be reopened without data loss and consistently after a sudden/unexpected shutdown, that is, with committed transactions completed and uncommitted transactions rolled back. We call this process crash recovery, and WAL (Write Ahead Log) is at the center of the crash recovery mechanism in PostgreSQL. PostgreSQL has all the necessary capabilities to meet the expectations we mentioned thanks to the WAL system, but in order to be able to say that this process will be completed without problems in a PostgreSQL database, and that the database will be opened consistently and without data loss at every shutdown, it is necessary to examine some configurations. In this article, we will examine the parameters, reasons for existence, and configuration options that will affect the crash recovery mechanism.

First, let’s remember the WAL writing system; information about a transaction in the database is first written to the memory area we call WAL buffers before being written to the WAL file. We can think of this as a cache application to speed up the WAL writing process. However, when the user commits the transaction, all information about this transaction must be written from the WAL buffers to the WAL files on disk, and the commit result is not returned to the user before this writing to disk is completed. This WAL writing process is what ensures that committed data is not lost in crash recovery, because even if the changes related to the transaction have not yet been transferred to the files of the tables on disk, there is no data loss thanks to the information being in the WAL files on disk. However, the WAL buffer is not the only cache factor in the WAL writing process in question. As seen below, this writing process goes through different cache mechanisms depending on the system configuration.

WAL Buffer

OS Cache

RAID Controller Cache

Disk Drive Cache

WAL Segment

To ensure that a WAL write operation has completed writing to disk by passing through the OS cache layer, PostgreSQL uses “fsync” or one of the similar system calls. (A system call can be defined as a program requesting a service from the operating system kernel.) However, this mechanism can be turned on or off in PostgreSQL with a parameter. The name of this parameter is again “fsync”. This mechanism, which is on by default with the value “on”, if it is turned off with “fsync=off”, PostgreSQL will send the WAL write operations to the operating system without providing “fsync” or similar validation and will not be able to guarantee that they have been written to disk. So why is this preferable? If the data in our database can be easily recreated and we want to perform very high-intensity operations in our current database with higher performance, we can consider the “fsync=off” configuration. However, we should keep in mind that our database may not be opened consistently in the event of a sudden shutdown, and that there is a possibility of data corruption and data loss. Let’s examine the small test below to see how much the database operation times will change in the “fsync=off” case. We see that the INSERT operation, which takes about 70 seconds with “fsync=on”, decreases to 38 seconds when “fsync=off” is set. The UPDATE operation decreases from 5 seconds to 3.6 seconds. (You can see that the effect changes under different conditions and different test methods.)

postgres=# CREATE TABLE Customers (
    ID  serial ,
    Name varchar(50) NOT NULL,
    Phone varchar(15) NOT NULL,
    PRIMARY KEY (ID)
);
CREATE TABLE

postgres=# \timing
Timing is on.


-- fsync=on

postgres=# INSERT INTO Customers( Name, Phone)
SELECT  left(md5(random()::text),10),
left(md5(random()::text),10)
FROM generate_series(1,10000000);
INSERT 0 10000000
Time: 69956.565 ms (01:09.957)

postgres=# UPDATE Customers set Phone = 1111111111 where ID<1000000;
UPDATE 999999
Time: 4942.934 ms (00:04.943)

postgres=# DELETE FROM Customers where ID>9000000;
DELETE 1000000
Time: 818.573 ms


-- Drop and recreate table
-- fsync=off

postgres=# INSERT INTO Customers( Name, Phone)
SELECT  left(md5(random()::text),10),
left(md5(random()::text),10)
FROM generate_series(1,10000000);
INSERT 0 10000000
Time: 37993.590 ms (00:37.994)

postgres=#  UPDATE Customers set Phone = 1111111111 where ID<1000000;
UPDATE 999999
Time: 3636.351 ms (00:03.636)

postgres=# DELETE FROM Customers where ID>9000000;
DELETE 1000000
Time: 1019.045 ms (00:01.019)

The parameter that determines the method we referred to above as “fsync or similar system calls” is the wal_sync_method parameter. You can access the values ​​that this parameter will take from this link. The default value changes depending on the operating system platform. For example, the default value is fdatasync in Linux and open_datasync in Windows. In order to change this value, it is necessary to understand how all these system calls work and to have a valid/conscious reason. Otherwise, we do not recommend changing the default value.

Another layer that we need to be aware of in the WAL write mechanism is the RAID controller cache layer. If we are using a RAID card with a cache on it in our system, all I/O operations pass through this cache. If “Write-Back Cache” is used as the cache method, the WAL write operation is written to the RAID controller cache and the information that the write operation is completed is returned to PostgreSQL before it is written to the disk drive. (At this point, we recommend that you read this article to get more information about RAID cache methods.) If Write-Back cache is to be used, it is very important to have batteries called BBU (Battery Backup Unit) on the RAID card. Thanks to these batteries, the data on the RAID controller cache is not lost immediately in the event of a power outage and the cache remains standing with the help of the BBU until the power is restored. The configuration in this layer is another factor that determines the WAL consistency of our PostgreSQL database, i.e. whether a successful “crash recovery” process is guaranteed. As in the RAID controller layer, cache usage may also be possible in the disk drive layer. We can query and close this cache configuration via operating systems. For example, in Linux, the commands “hdparm” for SATA disks and “sdparm” for SCSI disks can be used. You can access more detailed information from this link.

Another parameter that affects the WAL writing process is the “full_page_writes” parameter. In PostgreSQL, the pages (data blocks) that will be modified with a checkpoint trigger are completely written to the WAL before the change is made. In this way, the original version of the page is backed up on the WAL in case of a crash while writing to the page. If this backup is not made, it is likely that there will be a data set on the page that is partially updated and partially not updated, i.e. corruption. Although turning this parameter off with full_page_writes=off provides performance gains in operations on the database, it will also increase the risk of not performing a healthy crash recovery process, i.e. the possibility of data loss in the event of a crash.

At this point, let’s repeat our test and see roughly how much performance gain there will be in the “full_page_writes=off” case. When we repeat the operations, we see that the INSERT operation takes approximately 51 seconds with “fsync=on” and “full_page_writes=off”.

-- Drop and recreate table
-- fsync=on
-- full_page_writes=off

postgres=# INSERT INTO Customers( Name, Phone)
SELECT  left(md5(random()::text),10),
left(md5(random()::text),10)
FROM generate_series(1,10000000);
INSERT 0 10000000
Time: 51134.514 ms (00:51.135)

postgres=#  UPDATE Customers set Phone = 1111111111 where ID<1000000;
UPDATE 999999
Time: 3929.867 ms (00:03.930)

postgres=# DELETE FROM Customers where ID>7000000;
DELETE 1000000
Time: 933.309 ms

Finally, let’s look at another element that directly affects the WAL writing process, the “Asynchronous Commit” feature. This feature actually changes the mechanism that works synchronously by default, that is, does not return the commit result to the user before the writing process to the WAL file is completed. With the “Asynchronous Commit” configuration, which is activated by using “synchronous_commit=off”, it is not expected that the WAL records will be written with every commit. The WAL writing process can be delayed by a maximum of “wal_writer_delay” x 3ms. “Asynchronous Commit” is actually another option that allows the transactions to be completed faster, accepting the risk of losing the latest transactions in the event of a crash. There are also different values ​​that we can set for the “synchronous_commit” parameter other than on and off, and as seen below, the values ​​of the parameter also control the data consistency in the standby database, if any.

  • off = no consistent commits on primary
  • local = consistent commits on primary
  • remote_write = local + consistent commit on standby after PostgreSQL crash
  • on (default) = remote_write + consistent commit on standby after OS crash (if using synchronous replication)
  • remote_apply = on + standby query consistency

In conclusion, all the configuration options and parameters we have examined affect the crash recovery behavior of our PostgreSQL database. In this article, we have seen which factors we need to examine to ensure that our databases open consistently after a crash, and we have also examined which parameters we can use to increase performance, in cases where performance is more important than consistency.

Bir yanıt yazın

E-posta adresiniz yayınlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir