> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.
> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?
> why the file API so hard to use that even experts make mistakes?
I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.
I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.
Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.
Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).
Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.
Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.
Not really, there's been lots of APIs that have improved on the POSIX model.
The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).
I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).
> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.
int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);
// effects of calls to write(2)/etc. are invisible through any other file description
// until the close(2) is called on all descriptors to this file description.
close(fd);
So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!
It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.
You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.
Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.
There are two serious factual errors in your comment:
- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.
- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)
You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.
The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.
Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.
I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?
Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.
Sure, adding that functionality to NVMe would be easy; there are sufficient provisions around for adding such support. E.g. for example a global flag who's support is communicated and which can then be turned on by the host to cause the very same normal flush opcode to now also guarantee a pipelined write barrier (while retaining the flush-write-back-caches-before-reporting-completion-of-this-submitted-IO-operation).
The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons.
Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).
It sounds like you aren't very familiar with C; for example, C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.
So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.
— ⁂ —
I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().
But I can speculate! I think we can exclude the following possibilities:
- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...
- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.
- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.
- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.
- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.
So what does that leave?
Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.
Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.
In particular, maybe it uses too much CPU.
Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)
Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.
If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.
Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".
As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.
> C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
This is an overly pedantic, ungenerous interpretation of what I wrote.
First, fine - you can argue that C has functions, not methods. But eh.
Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.
Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.
I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.
The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.
That would not work as you describe it. The device will return completion upon the writes reaching its cache. You need a flush to ensure that the data reaches stable storage.
You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:
What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:
> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.
That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).
Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).
Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:
> make whole file read/writes atomic with a copy-on-write model,
I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?
> eliminate whole classes of filesystem bugs pretty quickly.
Block level deduplication is notoriously difficult.
> where the only kind of write allowed is to atomically append a chunk of data to the file
Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.
It doesn’t have to be one or the other. Developers could decide by passing flags to open.
But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.
The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.
It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.
That is not what an atomic write() function does and we are talking about APT, not databases.
If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.
> Databases implemented atomic transactions in the 70s.
And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).
Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.
It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).
Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.
This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.
> Developers could decide by passing flags to open.
Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.
> you’ll need enough free space to store both the old and new versions of your data.
The sacrifice is increased write wear on solid state devices.
> It would allow all sorts of useful programs to be written easily
Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.
They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.
Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?
As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?
And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.
Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.
Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.
This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.
Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.
> why the file API so hard to use that even experts make mistakes?
Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)
Jeremy Allison tracked down why POSIX standardized this behavior[0].
The reason is historical and reflects a flaw in the POSIX standards process, in my opinion, one that hopefully won't be repeated in the future. I finally tracked down why this insane behavior was standardized by the POSIX committee by talking to long-time BSD hacker and POSIX standards committee member Kirk McKusick (he of the BSD daemon artwork). As he recalls, AT&T brought the current behavior to the standards committee as a proposal for byte-range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large database vendors such as Oracle, Sybase and Informix (at the time). All of these companies did their own byte range locking within their own applications, none of them depended on or needed the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care". In the absence of any strong negative feedback on a proposal, the committee added it "as-is", and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.
The most egregious part of it for me is that if I open and close a file I might be canceling some other library's lock that I'm completely unaware of.
I resisted using them in my SQLite VFS, until I partially relented for WAL locks.
I wish more platforms embraced OFD locks. macOS has them, but hidden. illumos fakes them with BSD locks (which is worse, actually). The BSDs don't add them. So it's just Linux, and Windows with sane locking. In some ways Windows is actually better (supports timeouts).
> Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.
By the way, LMDB's main developer Howard Chu responded to the paper. He said,
> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.
So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.
This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes
> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.
This assumption was wrong for Intel Optane memory. Power loss could cut the data stream anywhere in the middle. (Note: the DIMM nonvolatile memory version)
consumer Optane were not "power loss protected", that is every different than not honoring a requested a synchronous write.
The crash-consistency problem is very different than the durability of real synchronous writes problem. There are some storage devices which will lie about synch writes, sometimes hoping that a backup battery will allow them to complete those write.
System crashes are inevitable, use things like write ahead logs depending on need etc... No storage API will get rid of all system crashes and yes even apple games the system by disabling real sync writes, so that will always be a battle.
You're missing the point. GP was mentioning the common assumption that all systems in the last 30 years are sector-atomic under power loss condition. Either the sector is fully written or fully not written. Optane was a rare counter example, where sector can become partially written, thus not sector-atomic.
There are known cases where power loss during a write can corrupt previously written data (data at rest). This is not some rare occurrence. This is why enterprise flash storage devices have power loss protection.
I wish someone would sell an SSD that was at most a firmware update away between regular NVMe drive and ZNS NVMe drive.
The latter just doesn't leave much room for the firmware to be clever and just swallow data.
Maybe also add a pSLC formatting mode for a namespace so one can be explicit about that capability...
It just has to be a drive that's useable as a generic gaming SSD so people can just buy it and have casual fun with it, like they did with Nvidia GTX GPUs and CUDA.
Unfortunately manifacturers almost always prefer price gouging on features that "CuStOmErS aRe NoT GoInG tO nEeD". Is it even a ZNS device available for someone who isn't a hyperscale datacenter operator nowadays?
Either you ask a manufacturer like WD, or you go to ebay AFAIK.
That said, ZNS is actually something specifically about being able to extract more value out of the same hardware (as the firmware no longer causes write amplification behind your back), which in turns means that the value for such a ZNS-capable drive ought to be strictly higher than for the traditional-only version with the same hardware.
And given that enterprise SSDs seem to only really get value from an OEM's holographic sticker on them (compare almost-new-grade used prices for those with the sticker on them vs. the just plain SSD/HDD original model number, missing the premium sticker), besides the common write-back-emergency capacitors that allow a physical write-back cache in the drive to ("safely") claim write-through semantics to the host, it should IMO be in the interest of the manufacturers to push ZNS:
ZNS makes, for ZNS-appropriate applications, the exact same hardware perform better despite requiring less fancy firmware.
Also, especially, there's much less need for write-back cache as the drive doesn't sort individual random writes into something less prone to write amplification: the host software is responsible for sorting data together for minimizing write amplification (usually, arranging for data that will likely be deleted together to be physically in the same erasure block).
Also, I'm not sure how exactly "bad" bins of flash behave, but I'd not be surprised if ZNS's support for zones having less usable space than LBA/address range occupied (which can btw. change upon recycling/erasing the zone!) would allow rather poor quality flash to still be effectively utilized, as even rather unpredictable degradation can be handled this way.
Basically, due to Copy-on-Write storage systems (like, Btrfs or many modern database backends (specifically, LSM-Tree ones)) inherently needing some slack/empty space, it's rather easy to cope with this space decreasing as a result of write operations, regardless of if the application/user data has actually grown from the writes: you just buy and add another drive/cluster-node when you run out of space, and until then, you can use 100% of the SSDs flash capacity, instead of up-front wasting capacity just to never have to decrease the drive's usable capacity over the warranty period.
Give me, say, a Samsung 990 Pro 2 TB for 250 EUR but with firmware for ZNS-reformatting, instead of the 200 EUR MSRP/173 EUR Amazon.de price for the normal version.
Oh, and please let me use a decent portion of that 2 GB LPDDR4 as controller memory buffer at least if I'm in a ZNS-only formatting situation. It's after all not needed for keeping large block translation tables around, as ZNS only needs to track where physically a logical zone is currently located (wear leveling), and which individual blocks are marked dead in that physical zone (easy linear mapping between the non-contiguous usable physical blocks and the contiguous usable logical blocks). Beyond that, I guess technically it needs to keep track of open/closed zones and write pointers and filled/valid lengths.
Furthermore, I don't even need them to warranty the device lifespan in ZNS, only that it isn't bricked from activating ZNS mode. It would be nice to get as many drive-writes warranty as the non-ZNS version gets, though.
ZNS (Zoned Namespace) technology seems to offer significant benefits by reducing write amplification and improving hardware efficiency. It makes sense that manufacturers would push for ZNS adoption, as it can enhance performance without needing complex firmware. The potential for using lower-quality flash effectively is also intriguing. However, the market dynamics, like the value added by OEM stickers and the need for write-back capacitors, complicate things. Overall, ZNS appears to be a promising advancement for specific applications.
Really? A 512-byte sector could get partially written? Did anyone actually observe this, or was it just a case of Intel CYA saying they didn't guarantee anything?
Yes, really. "Crash-consistent data structures were proposed by enforcing cacheline-level failure-atomicity" see references in: https://doi.org/10.1145/3492321.3519556
> the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".
And yet all of these systems basically work for day-to-day operations, and fail only under obscure error conditions.
It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.
For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.
It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.
> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?