> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, inc...

Retr0id · 2025-01-23T18:00:27 1737655227

> why the file API so hard to use that even experts make mistakes?

I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.

huntaub · 2025-01-24T02:15:32 1737684932

I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.

IgorPartola · 2025-01-24T02:19:04 1737685144

Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.

AutistiCoder · 2025-01-24T15:52:02 1737733922

The whole software ecosystem is built on bubblegum, tape, and prayers.

huntaub · 2025-01-24T02:23:00 1737685380

What aspects of NFS do you think break half of the important guarantees of a file system?

jcalvinowens · 2025-01-24T05:21:35 1737696095

Well, at least O_APPEND, O_EXCL, O_SYNC, and flock() aren't guaranteed to work (although they can with recent versions as I understand it).

UID mapping causing read() to return -EACCES after open() succeeds breaks a lot of userland code.

rwmj · 2025-01-24T10:14:42 1737713682

Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).

huntaub · 2025-01-24T14:21:49 1737728509

Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.

DiggyJohnson · 2025-01-24T06:23:35 1737699815

Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.

__loam · 2025-01-23T18:12:18 1737655938

POSIX is also so old and essential that it's hard to imagine an alternative.

jcranmer · 2025-01-23T19:43:32 1737661412

Not really, there's been lots of APIs that have improved on the POSIX model.

The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).

I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).

Joker_vD · 2025-01-23T19:59:24 1737662364

> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.

    int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);

    // effects of calls to write(2)/etc. are invisible through any other file description
    // until the close(2) is called on all descriptors to this file description.

    close(fd);

So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!

BobbyTables2 · 2025-01-24T00:46:08 1737679568

Surely this can’t always be true?

What happens when a lot of data is written and exceeds the dirty threshold?

Joker_vD · 2025-01-24T03:25:44 1737689144

It gets written on the disk but into different inodes, I imagine.

kragen · 2025-01-23T20:36:39 1737664599

It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.

josephg · 2025-01-23T21:50:42 1737669042

You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.

Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.

kragen · 2025-01-24T11:42:32 1737718952

There are two serious factual errors in your comment:

- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.

- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)

You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.

The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.

josephg · 2025-01-24T13:04:15 1737723855

Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.

I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?

Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.

namibj · 2025-01-24T23:22:38 1737760958

Sure, adding that functionality to NVMe would be easy; there are sufficient provisions around for adding such support. E.g. for example a global flag who's support is communicated and which can then be turned on by the host to cause the very same normal flush opcode to now also guarantee a pipelined write barrier (while retaining the flush-write-back-caches-before-reporting-completion-of-this-submitted-IO-operation).

The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons. Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).

ryao · 2025-01-25T07:37:59 1737790679

Here is your barrier syscall:

https://www.man7.org/linux/man-pages/man2/sync.2.html

kragen · 2025-01-25T15:15:26 1737818126

That flushes, just like the fsync() system call mentioned in Luu's post.

ryao · 2025-01-25T16:09:52 1737821392

It actually writes dirty data and then flushes.

That said, IO barriers in storage are typically synonymous with flushes. For example, the ext4 nobarrier mount option disables flushes.

kragen · 2025-01-24T15:15:45 1737731745

It sounds like you aren't very familiar with C; for example, C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.

Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.

So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.

— ⁂ —

I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().

But I can speculate! I think we can exclude the following possibilities:

- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...

- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.

- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.

- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.

- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.

So what does that leave?

Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.

Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.

In particular, maybe it uses too much CPU.

Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)

Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.

If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.

Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".

As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.

josephg · 2025-01-24T19:21:54 1737746514

> C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.

This is an overly pedantic, ungenerous interpretation of what I wrote.

First, fine - you can argue that C has functions, not methods. But eh.

Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.

Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.

I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.

namibj · 2025-01-23T23:48:00 1737676080

NVMe has no barrier that doesn't flush the pipeline/ringbuffer of IO requests submitted to it :(

kragen · 2025-01-24T11:46:16 1737719176

The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.

ryao · 2025-01-25T16:17:01 1737821821

That would not work as you describe it. The device will return completion upon the writes reaching its cache. You need a flush to ensure that the data reaches stable storage.

You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:

https://learn.microsoft.com/en-us/windows/win32/fileio/deplo...

What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:

https://lists.freebsd.org/pipermail/freebsd-fs/2011-November...

As pjd said:

> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.

That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).

Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).

ryao · 2025-01-23T22:25:02 1737671102

Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:

https://github.com/openzfs/zfs/blob/34205715e1544d343f9a6414...

Writes on ZFS cease to be atomic around approximately 32MB in size if I read the code correctly.

timewizard · 2025-01-23T21:13:12 1737666792

> make whole file read/writes atomic with a copy-on-write model,

I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?

> eliminate whole classes of filesystem bugs pretty quickly.

Block level deduplication is notoriously difficult.

> where the only kind of write allowed is to atomically append a chunk of data to the file

Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.

josephg · 2025-01-23T21:59:34 1737669574

It doesn’t have to be one or the other. Developers could decide by passing flags to open.

But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.

The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.

It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.

ryao · 2025-01-23T22:32:57 1737671577

Packages consist of multiple files. An atomic file write would not allow packages to be either installed or not installed by APT.

josephg · 2025-01-24T03:23:42 1737689022

Atomicity could encompass a whole bunch of writes at once.

Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.

ryao · 2025-01-25T07:25:52 1737789952

That is not what an atomic write() function does and we are talking about APT, not databases.

If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.

lmm · 2025-01-24T06:53:40 1737701620

> Databases implemented atomic transactions in the 70s.

And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).

josephg · 2025-01-24T18:59:13 1737745153

Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.

It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).

Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.

huntaub · 2025-01-24T02:17:27 1737685047

This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.

ryao · 2025-01-25T22:43:12 1737844992

It can also use ZFS to do this.

timewizard · 2025-01-23T22:58:50 1737673130

> Developers could decide by passing flags to open.

Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.

> you’ll need enough free space to store both the old and new versions of your data.

The sacrifice is increased write wear on solid state devices.

> It would allow all sorts of useful programs to be written easily

Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.

ryao · 2025-01-26T19:10:26 1737918626

Dropbox reversed its stance on this. It added support for ZFS, XFS, ecryptfs and btrfs:

https://help.dropbox.com/installs/system-requirements

They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.

josephg · 2025-01-24T03:32:28 1737689548

Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?

As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?

And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.

Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.

emmelaich · 2025-01-24T01:21:32 1737681692

Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.

mangamadaiyan · 2025-01-24T15:14:38 1737731678

This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.

hackit2 · 2025-01-24T01:43:34 1737683014

Just wait till he has to deal with raid controllers.

MisterTea · 2025-01-23T22:56:33 1737672993

I use Plan 9 regularly and while its Unix heritage is there, it most certainly isn't Unix and completely does away with POSIX.

timewizard · 2025-01-23T21:09:53 1737666593

> POSIX fs APIs and associated semantics

Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.

dkarl · 2025-01-23T18:21:06 1737656466

> why the file API so hard to use that even experts make mistakes?

Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.

(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)

ncruces · 2025-01-24T00:44:24 1737679464

POSIX file locking is clearly modeled around whatever was simplest to implement, although it makes no sense at all.

tjalfi · 2025-01-24T00:53:31 1737680011

Jeremy Allison tracked down why POSIX standardized this behavior[0].

The reason is historical and reflects a flaw in the POSIX standards process, in my opinion, one that hopefully won't be repeated in the future. I finally tracked down why this insane behavior was standardized by the POSIX committee by talking to long-time BSD hacker and POSIX standards committee member Kirk McKusick (he of the BSD daemon artwork). As he recalls, AT&T brought the current behavior to the standards committee as a proposal for byte-range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large database vendors such as Oracle, Sybase and Informix (at the time). All of these companies did their own byte range locking within their own applications, none of them depended on or needed the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care". In the absence of any strong negative feedback on a proposal, the committee added it "as-is", and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.

[0] https://www.samba.org/samba/news/articles/low_point/tale_two...

ncruces · 2025-01-24T01:27:41 1737682061

The most egregious part of it for me is that if I open and close a file I might be canceling some other library's lock that I'm completely unaware of.

I resisted using them in my SQLite VFS, until I partially relented for WAL locks.

I wish more platforms embraced OFD locks. macOS has them, but hidden. illumos fakes them with BSD locks (which is worse, actually). The BSDs don't add them. So it's just Linux, and Windows with sane locking. In some ways Windows is actually better (supports timeouts).

trinix912 · 2025-01-24T18:49:21 1737744561

> Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.

What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.

thfuran · 2025-01-23T21:19:59 1737667199

Why does that seem more likely than file system API simply not having been a major factor in the success of failure of OSes?

kccqzy · 2025-01-23T19:22:45 1737660165

By the way, LMDB's main developer Howard Chu responded to the paper. He said,

> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.

So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.

This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.

Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes

> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.

yuboyt · 2025-01-23T21:13:30 1737666810

This assumption was wrong for Intel Optane memory. Power loss could cut the data stream anywhere in the middle. (Note: the DIMM nonvolatile memory version)

nyrikki · 2025-01-23T22:17:54 1737670674

consumer Optane were not "power loss protected", that is every different than not honoring a requested a synchronous write.

The crash-consistency problem is very different than the durability of real synchronous writes problem. There are some storage devices which will lie about synch writes, sometimes hoping that a backup battery will allow them to complete those write.

System crashes are inevitable, use things like write ahead logs depending on need etc... No storage API will get rid of all system crashes and yes even apple games the system by disabling real sync writes, so that will always be a battle.

yuboyt · 2025-01-23T22:32:18 1737671538

You're missing the point. GP was mentioning the common assumption that all systems in the last 30 years are sector-atomic under power loss condition. Either the sector is fully written or fully not written. Optane was a rare counter example, where sector can become partially written, thus not sector-atomic.

x1f604 · 2025-01-23T23:43:04 1737675784

It is not rare for flash storage devices to lose data on power loss, even data that is FLUSH'd. See https://news.ycombinator.com/item?id=38371307

There are known cases where power loss during a write can corrupt previously written data (data at rest). This is not some rare occurrence. This is why enterprise flash storage devices have power loss protection.

See also: https://serverfault.com/questions/923971/is-there-a-way-to-p...

namibj · 2025-01-23T23:58:15 1737676695

I wish someone would sell an SSD that was at most a firmware update away between regular NVMe drive and ZNS NVMe drive. The latter just doesn't leave much room for the firmware to be clever and just swallow data.

Maybe also add a pSLC formatting mode for a namespace so one can be explicit about that capability...

It just has to be a drive that's useable as a generic gaming SSD so people can just buy it and have casual fun with it, like they did with Nvidia GTX GPUs and CUDA.

tliltocatl · 2025-01-24T07:13:38 1737702818

Unfortunately manifacturers almost always prefer price gouging on features that "CuStOmErS aRe NoT GoInG tO nEeD". Is it even a ZNS device available for someone who isn't a hyperscale datacenter operator nowadays?

namibj · 2025-01-24T23:09:24 1737760164

Either you ask a manufacturer like WD, or you go to ebay AFAIK.

That said, ZNS is actually something specifically about being able to extract more value out of the same hardware (as the firmware no longer causes write amplification behind your back), which in turns means that the value for such a ZNS-capable drive ought to be strictly higher than for the traditional-only version with the same hardware.

And given that enterprise SSDs seem to only really get value from an OEM's holographic sticker on them (compare almost-new-grade used prices for those with the sticker on them vs. the just plain SSD/HDD original model number, missing the premium sticker), besides the common write-back-emergency capacitors that allow a physical write-back cache in the drive to ("safely") claim write-through semantics to the host, it should IMO be in the interest of the manufacturers to push ZNS:

ZNS makes, for ZNS-appropriate applications, the exact same hardware perform better despite requiring less fancy firmware. Also, especially, there's much less need for write-back cache as the drive doesn't sort individual random writes into something less prone to write amplification: the host software is responsible for sorting data together for minimizing write amplification (usually, arranging for data that will likely be deleted together to be physically in the same erasure block).

Also, I'm not sure how exactly "bad" bins of flash behave, but I'd not be surprised if ZNS's support for zones having less usable space than LBA/address range occupied (which can btw. change upon recycling/erasing the zone!) would allow rather poor quality flash to still be effectively utilized, as even rather unpredictable degradation can be handled this way. Basically, due to Copy-on-Write storage systems (like, Btrfs or many modern database backends (specifically, LSM-Tree ones)) inherently needing some slack/empty space, it's rather easy to cope with this space decreasing as a result of write operations, regardless of if the application/user data has actually grown from the writes: you just buy and add another drive/cluster-node when you run out of space, and until then, you can use 100% of the SSDs flash capacity, instead of up-front wasting capacity just to never have to decrease the drive's usable capacity over the warranty period.

That said: https://priceblaze.com/0TS2109-WesternDigital-Solid-State-Dr... claims (by part number) to be this model: https://www.westerndigital.com/en-ae/products/internal-drive... . That's about 150 $/TB. Refurbished; doesn't say how much life has been sucked out of them.

Give me, say, a Samsung 990 Pro 2 TB for 250 EUR but with firmware for ZNS-reformatting, instead of the 200 EUR MSRP/173 EUR Amazon.de price for the normal version.

Oh, and please let me use a decent portion of that 2 GB LPDDR4 as controller memory buffer at least if I'm in a ZNS-only formatting situation. It's after all not needed for keeping large block translation tables around, as ZNS only needs to track where physically a logical zone is currently located (wear leveling), and which individual blocks are marked dead in that physical zone (easy linear mapping between the non-contiguous usable physical blocks and the contiguous usable logical blocks). Beyond that, I guess technically it needs to keep track of open/closed zones and write pointers and filled/valid lengths.

Furthermore, I don't even need them to warranty the device lifespan in ZNS, only that it isn't bricked from activating ZNS mode. It would be nice to get as many drive-writes warranty as the non-ZNS version gets, though.

etechbuy · 2025-01-27T08:15:35 1737965735

ZNS (Zoned Namespace) technology seems to offer significant benefits by reducing write amplification and improving hardware efficiency. It makes sense that manufacturers would push for ZNS adoption, as it can enhance performance without needing complex firmware. The potential for using lower-quality flash effectively is also intriguing. However, the market dynamics, like the value added by OEM stickers and the need for write-back capacitors, complicate things. Overall, ZNS appears to be a promising advancement for specific applications.

lmm · 2025-01-24T07:00:02 1737702002

Really? A 512-byte sector could get partially written? Did anyone actually observe this, or was it just a case of Intel CYA saying they didn't guarantee anything?

yuboyt · 2025-01-24T08:14:51 1737706491

Yes, really. "Crash-consistent data structures were proposed by enforcing cacheline-level failure-atomicity" see references in: https://doi.org/10.1145/3492321.3519556

lmm · 2025-01-24T22:31:48 1737757908

That reference appears to link to a DoI that doesn't actually exist.

senderista · 2025-01-24T04:51:54 1737694314

This is called “Atomic Write Unit Power Failure” (AWUPF).

Joker_vD · 2025-01-24T13:26:10 1737725170

> the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.

Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".

eviks · 2025-01-24T04:03:02 1737691382

> why the file API so hard to use that even experts make mistakes?

Because it was poorly designed, and there is a high resistance to change, so those design mistakes from decades ago continue to bite

liontwist · 2025-01-23T18:42:36 1737657756

Something this misses is that all programs make assumptions for example - “my process is the only one writing this file because it created it”

Evaluating correctness without that consideration is too high of a bar.

Safety and correctness cannot be “impossible to misuse”

nickelpro · 2025-01-24T05:00:41 1737694841

And yet all of these systems basically work for day-to-day operations, and fail only under obscure error conditions.

It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.

For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.

It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.

belter · 2025-01-24T15:39:13 1737733153

"PostgreSQL vs. fsync - How is it possible that PostgreSQL used fsync incorrectly for 20 years" - https://youtu.be/1VWIGBQLtxo