From the postgresql docs
[full page writes are] needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored…
How come the WAL logs are not enough to do a full restore? My understanding is that they contain all page updates since (atleast) the last checkpoint.
Can someone give an example of how simply replaying the WAL logs from the last checkpoint could still result in data loss?
Many (most?) redo records actually need the block to be in some reasonable state in order to replay into it.
Not all WAL records are row-level. Say you read and replay a WAL record that says “I defragmented this block according to the usual rule of defragmenting blocks”. This would be a deterministic operation given the correct starting state of the block, but given a mixture of blocks states it is not deterministic and so cannot be replayed safely. It might be possible to enumerate which WAL record types are of this type, and only have those types lead to FPW, but that seems brittle.
If a block did suffer a torn write, I for one would not want to rely on all the bytes in that block being exactly either the old state or new state, in precisely two contiguous chunks.
If the WAL could not be read entirely to the end (and how would one ever know if it was?) then there may be some blocks that have “new” data which was in fact “future” data, for a future which no longer exists. In theory this should never happen, as the WAL was synced to disk before the block data was released. But in the nanoseconds when a crash is evolving but has not yet manifested, who knows what could happen? Better to lose a few transactions that were reported to have been committed, then to corrupt the system entirely.