Why does the author use an explicit fence in the first and second example instead of just cmpxchg with acquire-release semantics (std::memory_order_acq_rel in C++) (and they already load with acquire semantics in the reader example[1])? Acquire semantics prevent later program-order loads and stores from being reordered before the seqlock acquisition. Edit: I guess Java does not have an Acq-Rel compare_exchange method, for some reason, but you could use the stronger seq-cst form. (I think it is somewhat moot on x86, where they all boil down to the same instruction.)
Also as a meta consideration, I wonder in what situations you would use a seqlock. I would worry about cache line contention on the seqlock for this mechanism, compared to something like hazard pointers. And frequent writers can arbitrarily block readers -- I guess if that is an issue, you're very far from a good application of this mechanism.
I don’t work with Java atomics, but I really don’t like the OP’s implementation.
The loadLoadFence just looks pointless and inefficient (and wrong?). Acquire should be sufficient, and reasonable architectures (Aarch64 and x86 at least) have cheap acquires. But most (?) memory models don’t have any ordering between a relaxed load and a subsequent acquire, so read is simply incorrect. On x86, the right answer is to do all the loads, including the data, as acquires. On aarch64, for very large data, it’s plausibly faster to use relaxed loads and a fence, or maybe the memory model is stronger than I think it is. But the code as written appears unsafe in general.
The write function is IMO terrible. Don’t roll your own crappy spinlock! Either just declare as a precondition that there is only one writer or use an actual mutex in the library. Modern mutexes are much, much more efficient than the nasty algorithm in the OP, and saving a single word of memory by combining the sequence and the mutex is simply not worthwhile.
Indeed the load/load fence must be before the "final long currentVersion" load—which in turn can be a relaxed load—not after the first load.
On x86 this produces optimal code because the load/load fence does not need to be compiled into a processor instruction (LFENCE and SFENCE exist but are only used in very special cases; they aren't needed for normal memory and normal load instructions).
You answered it yourself, I would think it’s because Java doesn’t have that. Using the stronger form emits stronger instructions that have a bigger performance impact. No point paying for that if you don’t need it. If you’re using this kind of thing, it’s specifically because performance really matters and you want to avoid that synchronized compare and exchange that a mutex uses. Otherwise why wouldn’t you just use a mutex or reader writer mutex.
> Using the stronger form emits stronger instructions that have a bigger performance impact.
I mean, I don’t think that's true in this case. Both forms generate the same instruction on x86 (lock cmpxchg for both). Godbolt: https://godbolt.org/z/vcMK9j6WP (And it appears to be identical on ARM64 as well.)
> If you’re using this kind of thing, it’s specifically because performance really matters and you want to avoid that synchronized compare and exchange that a mutex uses.
A full fence is not any less expensive than a seq-cst cmpxchg.
Do you have any idea why Java is missing acq-rel cmpxchg? I know they are late to the reasonable memory model party but it seems like an odd choice not to just import the C++ memory model wholesale. It's proved to be a decent model in practice!
I think we’re talking past each other. What I mean is that if you only need a load fence, that’s a lot cheaper (free on x64 if memory serves, just a compiler barrier) than lock cmpxchg. A mutex uses lock cmpxchg, so you’re arguably worse off than just using a mutex (or a reader writer mutex, which has similar performance characteristics for seldom write workloads.)
About C++ using the same instruction for both, that’s up to their implementation. There’s no requirement that the more relaxed atomics actually be more relaxed. Only that the stronger atomics don’t permit more relaxed behavior.
There may also be other requirements in the spec, that I’m not aware of, that require using the strong compare exchange.
It’s been years since I’ve done any substantial lock free programming, I may be remembering incorrectly.
A properly efficient sequence lock (no compare exchange or full acquire fence, relaxed loads for reads--which are less expensive on ARM than acquire loads, for good architectural reasons) isn't actually possible to write in C++ and I believe this was in fact a motivation for why Java atomics are the way they are. See https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf. I suspect this is why Java added load-load.
It has always been possible to write it in C++ since C++11, since acquire fences are cheap enough, but not in Java until loadFence() was introduced in JDK 8.
The paper you cite proposed compiling an atomic add-zero to MFENCE+MOV, which is 5 times slower than the rest of the seqlock access, just because fences were not in Java at the time and are "a very difficult-to-use and
C++11/C11-specific construct". That part of the paper has never made sense to me, and certainly does not matter now that Java has fences. Just wrap the seqlock into a primitive that hides the fence and call it a day.
Compilers are usually very reluctant to add "surprising" optimizations to atomics, even if they're fully justified by the standard, because (1) there are a lot of compiler bugs and/or holes in the standard that are only exposed after combining multiple optimizations involving atomics, (2) most programs explicitly use atomics only rarely, making it not the most productive thing to optimize, and (3) the programs and libraries that do use them are often reliant on them for synchronization, "liveness," or performance behavior that's not mandated by the standard. Common pathological examples include checks that assume the compiler won't skip assigning values if it can prove there's some obscure possible execution where the intermediate value can never be read, assuming that checking the result of a relaxed load in an if statement will guarantee that accesses run in the if will be sequenced after the read, etc.
> If you’re using this kind of thing ... why wouldn’t you just use a mutex or reader writer mutex.
>> you want the readers to be able to read multiple values atomically.
I read this in TFA, and interpreted it to mean you could read multiple values atomically, which would be an answer to your question - mutexes only protect one thing and you can't combine them atomically.
But looking at the code, it seems that this does no better? It only allows reading of multiple values insofar as it protects one hard-coded struct which happens to have multiple fields.
Interesting, but I'm still of the opinion that it's better to just avoid k-CAS in the first place. This is usually possible with sufficient control over your memory allocation/deallocation patterns.
(Also, I really hate spinlocks, since the scheduler can totally ruin them - with a good atomic algorithm, you can guarantee that if you have to loop, it is only because somebody is is actively making progress, with no possibility of them being unscheduled).
If, by control of allocation you mean doing something like RCU or its garbage-collected equivalent, sure, if it’s reasonable in your environment, go for it. But seqlocks and their variants produce no garbage and can be extremely efficient in appropriate contexts.
(The Linux vDSO, for example, cannot usefully participate in GC or RCU. seqlocks it is.)
“Almost” means that you can have n copies of the data, and the probability of waiting scales like 2^-n. Alternatively, and perhaps better, the degree of relative slowness of the reader needed to cause a wait goes like 2^n, so for any reasonable n, reads will never wait in practice. And it’s genuinely lock-free: the writer can hang at any point and read will continue to complete.
Somewhat sadly, this algorithm doesn’t seem to get much traction. I’ve considered implementing it for Linux clock reads to reduce tail latency. It’s really quite simple, and the common case works just like a seqlock.
I need to dig up the terms and conditions of this journal and post a non-paywalled version somewhere.
(Lock-free means that each thread is guaranteed to make progress even if any combination of other threads stop at arbitrary points. Normal seqlocks are not lock-free. Wait-free means that each thread completes each operation in a bounded about of running time, regardless of what the other threads do. In other words, livelocks are impossible. One can straightforwardly use a two copies of a data structure, write to them alternately, and protect them with the obvious sequence count, and get a lock-free, but not wait-free, data structure. The linked algorithm extends this to more than 2 copies, written with exponentially decreasing frequency.)
In Linux kernel context, you could plausibly prevent write stalls (other than NMIs) by masking interrupts for the write section. I guess there are some seriously slow NMIs though and we would prefer something that worked well despite those.
This doesn’t work in a VM or if there’s an SMI or if there is any other glitch.
There’s a more fundamental and harder-to-solve issue, though: time (CLOCK_MONOTONIC or whatever) is a function of the underlying clock source (TSC, for example), and that function is defined piecewise (and must be if it changes in ways that are not known in advance). And, fundamentally, if the thread evaluating the function after some change gets ahead of the thread making the change in the first place, it’s awkward to get the right answer without waiting.
The best I’ve come up with is for the thread running the show to record that, for TSC values above t, a new function is used, and to do this well in advance of t. I don’t know a way to do an atomic commit of a change if and only iff TSC < some threshold.
EPVS is used in Microsoft FASTER, a KV storage engine that uses a hybrid log as its primary data structure: https://dl.acm.org/doi/10.1145/3183713.3196898, which is OSS and used by Microsoft (where I work), with implementations in C++ and C#: https://microsoft.github.io/FASTER/