> why the file API so hard to use that even experts make mistakes? I think the s...

huntaub · 2025-01-24T02:15:32 1737684932

I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.

IgorPartola · 2025-01-24T02:19:04 1737685144

Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.

AutistiCoder · 2025-01-24T15:52:02 1737733922

The whole software ecosystem is built on bubblegum, tape, and prayers.

huntaub · 2025-01-24T02:23:00 1737685380

What aspects of NFS do you think break half of the important guarantees of a file system?

jcalvinowens · 2025-01-24T05:21:35 1737696095

Well, at least O_APPEND, O_EXCL, O_SYNC, and flock() aren't guaranteed to work (although they can with recent versions as I understand it).

UID mapping causing read() to return -EACCES after open() succeeds breaks a lot of userland code.

rwmj · 2025-01-24T10:14:42 1737713682

Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).

huntaub · 2025-01-24T14:21:49 1737728509

Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.

DiggyJohnson · 2025-01-24T06:23:35 1737699815

Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.

__loam · 2025-01-23T18:12:18 1737655938

POSIX is also so old and essential that it's hard to imagine an alternative.

jcranmer · 2025-01-23T19:43:32 1737661412

Not really, there's been lots of APIs that have improved on the POSIX model.

The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).

I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).

Joker_vD · 2025-01-23T19:59:24 1737662364

> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.

    int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);

    // effects of calls to write(2)/etc. are invisible through any other file description
    // until the close(2) is called on all descriptors to this file description.

    close(fd);

So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!

BobbyTables2 · 2025-01-24T00:46:08 1737679568

Surely this can’t always be true?

What happens when a lot of data is written and exceeds the dirty threshold?

Joker_vD · 2025-01-24T03:25:44 1737689144

It gets written on the disk but into different inodes, I imagine.

kragen · 2025-01-23T20:36:39 1737664599

It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.

josephg · 2025-01-23T21:50:42 1737669042

You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.

Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.

kragen · 2025-01-24T11:42:32 1737718952

There are two serious factual errors in your comment:

- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.

- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)

You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.

The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.

josephg · 2025-01-24T13:04:15 1737723855

Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.

I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?

Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.

namibj · 2025-01-24T23:22:38 1737760958

Sure, adding that functionality to NVMe would be easy; there are sufficient provisions around for adding such support. E.g. for example a global flag who's support is communicated and which can then be turned on by the host to cause the very same normal flush opcode to now also guarantee a pipelined write barrier (while retaining the flush-write-back-caches-before-reporting-completion-of-this-submitted-IO-operation).

The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons. Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).

ryao · 2025-01-25T07:37:59 1737790679

Here is your barrier syscall:

https://www.man7.org/linux/man-pages/man2/sync.2.html

kragen · 2025-01-25T15:15:26 1737818126

That flushes, just like the fsync() system call mentioned in Luu's post.

ryao · 2025-01-25T16:09:52 1737821392

It actually writes dirty data and then flushes.

That said, IO barriers in storage are typically synonymous with flushes. For example, the ext4 nobarrier mount option disables flushes.

kragen · 2025-01-24T15:15:45 1737731745

It sounds like you aren't very familiar with C; for example, C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.

Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.

So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.

— ⁂ —

I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().

But I can speculate! I think we can exclude the following possibilities:

- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...

- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.

- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.

- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.

- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.

So what does that leave?

Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.

Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.

In particular, maybe it uses too much CPU.

Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)

Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.

If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.

Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".

As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.

josephg · 2025-01-24T19:21:54 1737746514

> C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.

This is an overly pedantic, ungenerous interpretation of what I wrote.

First, fine - you can argue that C has functions, not methods. But eh.

Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.

Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.

I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.

namibj · 2025-01-23T23:48:00 1737676080

NVMe has no barrier that doesn't flush the pipeline/ringbuffer of IO requests submitted to it :(

kragen · 2025-01-24T11:46:16 1737719176

The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.

ryao · 2025-01-25T16:17:01 1737821821

That would not work as you describe it. The device will return completion upon the writes reaching its cache. You need a flush to ensure that the data reaches stable storage.

You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:

https://learn.microsoft.com/en-us/windows/win32/fileio/deplo...

What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:

https://lists.freebsd.org/pipermail/freebsd-fs/2011-November...

As pjd said:

> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.

That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).

Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).

ryao · 2025-01-23T22:25:02 1737671102

Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:

https://github.com/openzfs/zfs/blob/34205715e1544d343f9a6414...

Writes on ZFS cease to be atomic around approximately 32MB in size if I read the code correctly.

timewizard · 2025-01-23T21:13:12 1737666792

> make whole file read/writes atomic with a copy-on-write model,

I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?

> eliminate whole classes of filesystem bugs pretty quickly.

Block level deduplication is notoriously difficult.

> where the only kind of write allowed is to atomically append a chunk of data to the file

Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.

josephg · 2025-01-23T21:59:34 1737669574

It doesn’t have to be one or the other. Developers could decide by passing flags to open.

But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.

The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.

It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.

ryao · 2025-01-23T22:32:57 1737671577

Packages consist of multiple files. An atomic file write would not allow packages to be either installed or not installed by APT.

josephg · 2025-01-24T03:23:42 1737689022

Atomicity could encompass a whole bunch of writes at once.

Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.

ryao · 2025-01-25T07:25:52 1737789952

That is not what an atomic write() function does and we are talking about APT, not databases.

If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.

lmm · 2025-01-24T06:53:40 1737701620

> Databases implemented atomic transactions in the 70s.

And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).

josephg · 2025-01-24T18:59:13 1737745153

Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.

It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).

Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.

huntaub · 2025-01-24T02:17:27 1737685047

This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.

ryao · 2025-01-25T22:43:12 1737844992

It can also use ZFS to do this.

timewizard · 2025-01-23T22:58:50 1737673130

> Developers could decide by passing flags to open.

Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.

> you’ll need enough free space to store both the old and new versions of your data.

The sacrifice is increased write wear on solid state devices.

> It would allow all sorts of useful programs to be written easily

Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.

ryao · 2025-01-26T19:10:26 1737918626

Dropbox reversed its stance on this. It added support for ZFS, XFS, ecryptfs and btrfs:

https://help.dropbox.com/installs/system-requirements

They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.

josephg · 2025-01-24T03:32:28 1737689548

Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?

As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?

And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.

Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.

emmelaich · 2025-01-24T01:21:32 1737681692

Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.

mangamadaiyan · 2025-01-24T15:14:38 1737731678

This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.

hackit2 · 2025-01-24T01:43:34 1737683014

Just wait till he has to deal with raid controllers.

MisterTea · 2025-01-23T22:56:33 1737672993

I use Plan 9 regularly and while its Unix heritage is there, it most certainly isn't Unix and completely does away with POSIX.

timewizard · 2025-01-23T21:09:53 1737666593

> POSIX fs APIs and associated semantics

Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.