Hacker News

ciconia
SQLite's documentation about its durability properties is unclear agwa.name

liuliua day ago

Marc Brooker said this: https://x.com/MarcJBrooker/status/1960809302333251876 which echos my sentiment. I will just repost it here:

> More broadly, I don't think a single definition of 'durable' (as in ACID D) for transactions is particularly useful.

> Much more useful is to ask "what kinds of failures could cause committed transactions to be lost?"

All these articles talking about durability as a singular term should be warned. Writing to disk is not durable under certain circumstances. fsync is not durable under certain circumstances. Two-phase commit is not durable under certain circumstances. Multi-data center commit is not durable against the Death Star too.

yndoendo16 hours ago

I used SQLite in an embedded life safety system for years. It was storing settings and history events on an industrial MicroSD card. Failure happens for the most common reasons it happens across all other products.

1) MicroSD card started producing bad sectors. 2) Power failure during write. 3) Fraudulent MicroSD being used.

Settings section of the database where written so few times that they could be recovered. It was the journaling of events for analysis that where the ones to be lost. Most of the time it was a page or two that was corrupted and the database worked just fine until detailed reporters where created.

Settings where also mirrored to a backup location. Unless the client wanted to pay more for history backup it was only retained locally.

[deleted]a day agocollapsed

danga day ago

Recent and related:

SQLite (with WAL) doesn't do `fsync` on each commit under default settings - https://news.ycombinator.com/item?id=45005071 - Aug 2025 (90 comments)

with a relevant comment by the creator of SQLite here: https://news.ycombinator.com/item?id=45014296

(via https://news.ycombinator.com/item?id=45068594 - thanks int_19h!)

setheva day ago

The documentation seems pretty clear to me - it describes specifically what each option controls and the implications of using it. Besides debating whether the default behavior is should be described as durable or not, this post's author seems to understand exactly what each option actually does.

Perhaps what's unclear is when to select which option?

avinassha day ago

I wrote the first article, and I thought documentation is clear, but then I saw comment by Hipp which got confused me:

> If you switch to WAL mode, the default behavior is that transactions are durable across application crashes (or SIGKILL or similar) but are not necessarily durable across OS crashes or power failures. Transactions are atomic across OS crashes and power failures. But if you commit a transaction in WAL mode and take a power loss shortly thereafter, the transaction might be rolled back after power is restored.

https://news.ycombinator.com/item?id=45014296

the documentation is in contradiction with this.

agwaa day ago

I found the documentation much harder to parse than the equivalent PostgreSQL docs (https://www.postgresql.org/docs/current/wal-async-commit.htm...).

Also, even if I've understood the docs correctly, a number of people in this thread and elsewhere have come to a different interpretation. I think that's much less likely to happen with the PostgreSQL docs.

And I'm sure you can understand why I began to doubt my own interpretation of the docs when SQLite's creator posted a comment saying the exact opposite of what I thought the docs said!

setheva day ago

Fair! Fwiw, I enjoyed the post - hopefully my comment wasn't harsh.

I think it does come down to the definition of durable. The default in SQLite is that the data for a transaction will be written all the way to disk (with an fsync) and the rollback journal will be deleted (but without an fsync).

In this model, there is chance of losing only the last transaction and only if the whole system crashes (like a power failure) after deleting the journal but before your file system makes that durable. If your application crashes, you're still fine.

That's significantly more durable than an asynchronous commit in Postgres, though. In an asynchronous commit, the transaction will complete before any data is written to disk at all - the data is only in memory, and many transactions can queue up in memory during the wal_writer_delay. All of those will definitely be lost if Postgres stops uncleanly for any reason (for example a crash, oomkill, or power failure).

NewsaHackO2 days ago

> By default, SQLite is not durable, because the default value of journal_mode is DELETE, and the default value of synchronous is FULL, which doesn't provide durability in DELETE mode.

From the documentation, it seems like synchronous being FULL does provide durability of the database in DELETE mode, as FULL means it calls fsync after the transaction is completed. I think you may be confusing durability of the journal file with durability of the database. I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file.

mlyle2 days ago

> I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file.

In general: WAL means you write the transaction to the WAL, fsync (in sqlite, this depends upon the sync mode], and then return it's done to the application. The transaction is then durable: even if the database crashes, the contents of the WAL will be applied to the database file.

Checkpointing later just lets you throw away that part of the WAL and not have to replay as much to the database file.

agwa2 days ago

My understanding of DELETE mode is that the transaction is not committed until the rollback journal file is deleted - if the rollback journal is present when sqlite opens a database, it applies the rollback journal to undo the changes that the transaction made. See https://www.sqlite.org/atomiccommit.html#1_deleting_the_roll...

If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.

kbakera day ago

> If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.

I think this is the part that is confusing.

The fsyncing of the directory is supposed to be done by the filesystem/OS itself, not the application.

From man fsync,

    As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
So from sqlite's perspective on DELETE it is either: before the fsync call, and not committed, or after the fsync call, and committed (or partially written somehow and needing rollback.)

Unfortunately it seems like this has traditionally been broken on many systems, requiring workarounds, like SYNCHRONOUS = EXTRA.

agwaa day ago

No, the metadata is information like the modification time and permissions, not the directory entry.

The next paragraph in the man page explains this:

> Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

https://man7.org/linux/man-pages/man2/fsync.2.html

Edit to add: I don't think there's a single Unix-like OS on which fsync would also fsync the directory, since a file can appear in an arbitrary number of directories, and the kernel doesn't know all the directories in which an open file appears.

This is a moot point anyways, because in DELETE mode, the operation that needs to be durably persisted is the unlinking of the journal file - what would you fsync for that besides the directory itself?

kbakera day ago

OK, interesting, I think I see... So you are asking about if SQLite opens and finds a not-committed rollback journal that looks valid, then it rolls it back?

I was more curious so I looked at the code here:

https://sqlite.org/src/file?name=src/pager.c&ci=trunk

and found something similar to what you are asking in this comment before `sqlite3PagerCommitPhaseTwo`:

    ** When this function is called, the database file has been completely
    ** updated to reflect the changes made by the current transaction and
    ** synced to disk. The journal file still exists in the file-system
    ** though, and if a failure occurs at this point it will eventually
    ** be used as a hot-journal and the current transaction rolled back.
So, it does this:

    ** This function finalizes the journal file, either by deleting,
    ** truncating or partially zeroing it, so that it cannot be used
    ** for hot-journal rollback. Once this is done the transaction is
    ** irrevocably committed.
Assuming fsync works on both the main database and the hot journal, then I don't see a way that it is not durable? Because, it has to write and sync the full hot journal, then write to the main database, then zero out the hot journal, sync that, and only then does it atomically return from the commit? (assuming FULL and DELETE)

agwa20 hours ago

> OK, interesting, I think I see... So you are asking about if SQLite opens and finds a not-committed rollback journal that looks valid, then it rolls it back?

Right.

> Assuming fsync works on both the main database and the hot journal, then I don't see a way that it is not durable? Because, it has to write and sync the full hot journal, then write to the main database, then zero out the hot journal, sync that, and only then does it atomically return from the commit? (assuming FULL and DELETE)

DELETE mode doesn't truncate or zero the journal; it merely deletes it from the directory. You need to switch to TRUNCATE or PERSIST for that behavior: https://sqlite.org/pragma.html#pragma_journal_mode

I confirmed all of this by attaching gdb to SQLite and setting a breakpoint on unlink. At the time of unlink the journal is still "hot" and usable for rollback: https://news.ycombinator.com/item?id=45069533

kbaker10 hours ago

Yeah, I see the comments down below in pager.c which explain it a bit better. I guess I thought its behavior was more like PERSIST by default.

    **   journalMode==DELETE
    **     The journal file is closed and deleted using sqlite3OsDelete().
    **
    **     If the pager is running in exclusive mode, this method of finalizing
    **     the journal file is never used. Instead, if the journalMode is
    **     DELETE and the pager is in exclusive mode, the method described under
    **     journalMode==PERSIST is used instead.
So I guess this is one of the tradeoffs SQLite makes between extreme durability with (PERSIST, TRUNCATE, or using EXTRA) vs speed.

I know we have used SQLite quite a bit without getting into this exact scenario - or at least the transaction that would have been rolled back wasn't important enough in our case, I guess when the device powers down milliseconds later to trigger this case we never noticed (or needed this transaction), only that the DB remained consistent.

And it seems like if DELETE is the default and doesn't get noticed enough in practice that it needs to be changed to a different safer default (like to PERSIST or something,) I guess it is better to have the speed gains from reducing that extra write + fsync.

But, now I guess you know enough to submit better documentation upstream for a sliding scale of durability, I definitely agree that the docs could be better about how to get ultimate durability, and how to tune the knobs for `journal_mode` and `synchronous`.

NewsaHackO2 days ago

>if the rollback journal is present when sqlite opens a database, it applies the rollback journal to undo the changes that the transaction made. See https://www.sqlite.org/atomiccommit.html#1_deleting_the_roll...

I don't follow. How would fsyncing the rollback journal affect the durability of the actual database? Do you actually think that the database would reapply an already committed journal whose ID in the header already indicates that the transaction was committed, when the database is already consistent? I really think you should re-review the definition of durability of a database, especially before saying the creator of SQLite is incorrect about its implementation.

int_19h2 days ago

It's specifically about fsyncing journal deletion. The problem isn't that it would reapply it if it was already used to rollback. Rather, the problem is that if you commit, and that commit has succeeded (and so your app believes that it has written the data and might e.g. perform some other actions on it), the deletion of the now-unneeded journal might not be flushed to disk in event of power loss or similar. So when you start the app again, SQLite sees said rollback journal, and - since it would be considered "hot" - applies it, effectively reverting the transaction that was supposedly already committed.

FWIW I don't think it's wrong per se. The article links to a HN comment in which Richard Hipp explains why this is the default behavior, and it does make sense: https://news.ycombinator.com/item?id=45014296. At the same time, clearly, the definition of "durable" here could use some clarification.

agwaa day ago

Yes, that's exactly right.

Note that the comment by Richard Hipp is justifying why WAL mode is not durable by default. It's a completely reasonable explanation, and would be for DELETE mode too, yet his comment claims that DELETE mode is durable by default, which I can't reconcile with the docs.

NewsaHackOa day ago

>So when you start the app again, SQLite sees said rollback journal, and - since it would be considered "hot" - applies it, effectively reverting the transaction that was supposedly already committed.

Guys. The journal would not be a hot journal though, as the hot journal selection only applies if the database is in a inconsistent state. Otherwise, the database knows from the ID of the journal not to reapply an already applied rollback journal. The process you are talking about ONLY happens when the journal database has been corrupted state, and it has to try and file a file to help recover the database.

agwaa day ago

OK, I just tested it:

In terminal 1, I created a database and added a table to it:

  $ sqlite3 testdb
  sqlite> create table test (col int);
In terminal 2, I attached gdb to sqlite3 and set a breakpoint on unlink:

  $ gdb sqlite3 `pidof sqlite3`
  (gdb) b unlink
  (gdb) c
Back in terminal 1, I inserted data into the table:

  sqlite> insert into test values(123);
In terminal 3, I saved a copy of testdb-journal:

  $ cp testdb-journal testdb-journal.save
Then in terminal 2, I resumed executing sqlite3:

  (gdb) c
In terminal 1, the INSERT completed without error.

Back in terminal 3, I sent SIGKILL to sqlite3, simulating a power failure:

  $ killall -9 sqlite3
I then restored testdb-journal, simulating what could happen after a power failure when the parent directory is not fsynced:

  $ mv testdb-journal.save testdb-journal
I then opened testdb again and ran `SELECT * FROM test` and it returned zero rows.

This proves int_19h and I are right - if the journal file comes back, SQLite will apply it and roll back a committed transaction.

I then confirmed with strace that, as the documentation says, the directory is only fsynced after unlink when synchronous=EXTRA. It doesn't happen with synchronous=FULL. So you need synchronous=EXTRA to get durability in DELETE mode.

agwaa day ago

The docs list 5 conditions that all must be satisfied for the journal to be considered hot: https://www.sqlite.org/atomiccommit.html#_hot_rollback_journ...

I believe they would all be satisfied.

I don't see any mention of checking IDs. Not saying you're wrong - I think the docs could very well be wrong - but could you provide a citation for that behavior?

NewsaHackOa day ago

Please read the entire document instead of just picking out sections; you will then be able to see where your misconceptions are occurring. You have attempted to make this same point three times, so I will say it for a third time; that section is about CORRUPTED databases, not database that are consistent after fsync.

agwaa day ago

I read the whole document. It doesn't mention IDs anywhere.

If you're not going to provide citations for your claims, yet criticize me for "picking out sections" when I provide citations, then continuing this conversation won't be productive.

aktiura day ago

The text in this document also directly contradicts what you're saying. Put another way: the presence of a hot journal is how SQLite determines the database might be corrupted.

https://sqlite.org/lockingv3.html#hot_journals

int_19ha day ago

> Otherwise, the database knows from the ID of the journal not to reapply an already applied rollback journal.

But it's not "already applied", that's the whole point. The transaction was committed, not rolled back, so the changes in transaction were persisted to disk and the journal was just thrown away. If it magically reappears again, how is SQLite supposed to know that it needs to be discarded again rather than applied to revert the change?

agwaa day ago

I'm pretty sure I understand what durability means; the definition is not hard - https://en.wikipedia.org/wiki/Durability_(database_systems)

It's possible I've misunderstood how DELETE mode works. But here's the thing - I shouldn't have to understand how DELETE mode works to know what SQLite setting I need to use to get durability. Unfortunately, the SQLite docs don't clearly say what guarantees each setting provides - instead they talk about about what SQLite does when you choose the setting, leaving the reader to try to figure out if those actions provide durability. And the docs really make it seem like you need synchronous=EXTRA in DELETE mode to get durability, for the reasons explained above.

This is a docs problem; I'm not saying SQLite is buggy.

NewsaHackOa day ago

This may just be a expectations difference then. I would fully expect a developer to read the docs and know how a settings works to know what guarantees it has.

avinassha day ago

> I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file

why not? if you use synchronous=FULL, then WAL does provide durable transactions, no?

layer8a day ago

Durability also requires the file system implementation and the disk to do the right thing on fsync, which, if I recall past discussions correctly, isn’t a given.

ericbarretta day ago

There are some older fsync() bugs (as famously explored by Postgres developers: https://wiki.postgresql.org/wiki/Fsync_Errors ) but I'm not aware of any modern mainstream kernel where this is broken. If I'm wrong, please tell me!

An application that really wants confidence in a write—to the extent that the underlying device and drivers allow—should use O_DIRECT. Or maybe there is a modern equivalent with io-uring. But that is not easy engineering :)

vlovich123a day ago

O_DIRECT in now way absolves you from needing to call fsync because fsync ALSO sends a signal to the storage device to flush the buffer if it has anything which is important for durability.

What OP is referring to is that some drives ignore that signal for performance reasons and there’s nothing SW can do to solve that part.

io_uring in no way changes the rules here.

[deleted]a day agocollapsed

jitla day ago

macOS is a popular OS that has a fast and loose relationship to that syscall without F_FULLFSYNC

codysa day ago

Nothing prevents using O_DIRECT as an open-flag for a fd used in other io_uring operations.

But I'm not sure I'd necessarily think of O_DIRECT as a way of improving "confidence in a write". It's a way to get a specific behavior.

geertja day ago

Technically I think you need O_SYNC. O_DIRECT does pretty much the same but its intention is different.

eatonphila day ago

Take a look at Alex Miller's diagrams for what function calls are actually doing on various systems.

https://transactional.blog/how-to-learn/disk-io

diekhansa day ago

It seems like a bug report on what is not clear in the documentation would be highly useful.

tiffanyh2 days ago

SQLite is an incredible piece of software, and its commitment to backward compatibility is deeply admirable. But that same promise has also become a limitation.

v3.0 was first released in 2004—over 20 years ago—and the industry has changed dramatically since then.

I can’t help but wish for a “v4.0” release: one that deliberately breaks backward compatibility and outdated defaults, in order to offer a cleaner, more modern foundation.

Note: I'm not asking for new functionality per se. But just a version of SQLite that defaulted to how it should be used, deployed in 2025.

umpalumpaaa2 days ago

There was an attempt/experiment to develop SQLite 4: https://sqlite.org/src4/doc/trunk/www/index.wiki

sgbeala day ago

The focus of that experiment was to find out of LSM-based storage[^1] would prove to be faster. It turned out that LSM, for SQLite's workloads, did not provide enough benefit to justify the upheaval.

[^1]: https://en.wikipedia.org/wiki/Log-structured_merge-tree

nikisweetinga day ago

https://turso.tech I think attracts a lot of the people trying add new features / improve SQLite rough edges

raggia day ago

fsync is rarely truly durable in the sense the article describes. it does help with loss ordering in a lot of cases, and flushes often do some work, but true durability, the idea that after fsync there will not be any rollback or tail loss unless there's catastrophic failure, sorry, nope. everyone in the chain ends up in the hot path being the hotspot, then they break it because y'all are addicted to spamming syncs then they move it to a new api, then slowly the syncs come back on the new api, then everyone moves again layer by layer. somewhat common examples in recent years are nvme vendors who implement nvme flush in terms of just pushing the write cache down, but won't always also finalize and flush in flight or scheduled ftl operations due to the insane worst case latency costs associated. weren't apple also caught doing the same in recent years, in part because their ftl shared memory, bus and privilege with the higher exception levels? there's also the rumors people say about enterprise drives being better here, but not doing so is even a saleable product in those environments: https://www.atpinc.com/technology/ssd-flush-cache-technology and some other vendors just have arbitrary firmware patches to compete (that is: you could buy "enterprise grade" hardware second hand and be entirely unaware of the actual command behavior).

topspin2 days ago

[flagged]

3eb7988a16632 days ago

It is the weekend. Most people use this site to avoid doing work.

Edit: whoops, it is Friday! Gave myself a long weekend, and was just default thinking it is Saturday.

lyjackal2 days ago

it's Friday, and I'm avoiding doing work

3eb7988a16632 days ago

Ha! Whelp, it is my weekend.

meindnoch2 days ago

[flagged]

anotherhue2 days ago

I'm laughing but mostly crying.

Reliability is a dirty word, because it almost always comes at the cost of 'growth'.

d1l2 days ago

This is disingenuous and probably was written this way for HN cred and clicks. Sqlite's test suite simulates just about every kind of failure you can imagine - this document is worth reading if you have any doubts: https://www.sqlite.org/atomiccommit.html

eatonphila day ago

> Sqlite's test suite simulates just about every kind of failure you can imagine

The page you link even mentions scenarios they know about that do happen and that they still assume won't happen. So even sqlite doesn't make anywhere near as strong a claim as you make.

> SQLite assumes that the operating system will buffer writes and that a write request will return before data has actually been stored in the mass storage device. SQLite further assumes that write operations will be reordered by the operating system. For this reason, SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.

chasila day ago

There was a time that Oracle databases used raw disk partitions to minimize the influence of the OS in what happens between memory and storage. It was more for multiple instances looking at the same SCSI device (Oracle Parallel Server).

I don't think that is often done now.

liuliua day ago

> So even sqlite doesn't make anywhere near as strong a claim as you make.

And? If you write to a disk and later this disk is missing, you don't have durability. SQLite cannot automatically help you to commit your writes to a satellite for durability against species ending event on Earth, and hence its "durability" has limits exactly as spelled out by them.

eatonphil21 hours ago

You're arguing a strawman and I pointed at a specific example. Sticking with my specific example they could probe for this behavior or this OS version and crash immediately, telling the user to update their OS. Instead it seems they acknowledge this issue exists and they hope it doesn't happen. Which hey everybody does but that's not the claim OP was making.

grebc3 hours ago

It’s not really a libraries job to cover all bases like you’re suggesting. They outline the failure scenarios fairly well and users are expected to take note.

agwa2 days ago

That document addresses atomicity, not durability, and is thus non-responsive to my concerns.

[deleted]2 days agocollapsed

bawolff2 days ago

I can't help but feel that the difference to other DBs is that they just don't have these knobs or tell you at all.

agwa2 days ago

PostgreSQL has the knobs and I find the documentation about them very clear: https://www.postgresql.org/docs/current/wal-async-commit.htm...

cenamus2 days ago

So this article ask exactly the same as the reply do Dr Hipps comment, just in a 1000 words, instead of 10? Whether the docs are out of sync?

topspin2 days ago

> Whether the docs are out of sync?

Were this a one off, you would have a point. It isn't, however. My experience over many years has been that you can't ever be certain about what is actually going on, based on the documentation alone, and that you wind up in Reddit and Stack Overflow and a plethora of blog posts attempting to figure it out. With LLMs, we have only more sources of contradictory and chronically obsolescent input.

There is an actual problem here. However I can see that, based on the contributions from the SQLite downmod mafia, this talk isn't welcome, so I'm off to some other thing. Have a nice weekend, I suppose.

mtmail2 days ago

> the SQLite downmod mafia

Oh, come on. There's no open or secret attempt at censoring talk about sqlite on HN. (The story is #11 on the frontpage the minute the comment was made.)

hn-front (c) 2024 voximity
source