> And for this new hash table, the time required for worst-case queries and inse...

frakt0x90 · on Feb 10, 2025

I haven't read the paper, but sometimes asymptotic improvements do not translate to real world improvements due to a large multiplicative factor in the complexity that gets factored out in the O() analysis. So the dataset required to see speed-up is impractically large.

orlp · on Feb 10, 2025

In this case "x" is 1/d where d is the unused fraction of space.

So if you leave 0.1% of your hashtable unused your x is 1000 - quite problematic. However if you leave 12.5% of your hashtable unused your x is 8 - quite reasonable, and not something logarithmic behavior would necessarily speed up, for reasonable constants.

pclmulqdq · on Feb 11, 2025

This is pretty much exactly the case for this algorithm. It is a very slow hash table due to the lack of locality. It seems to only have benefits at very large size and very high load factor.

At that scale, it may be practically better to use a B-tree or a radix tree over a 128-bit hash of the key.

MichaelDickens · on Feb 10, 2025

I'm not up to date on the state of the art but I've implemented hash tables a few times and we would expand the hash table whenever it was 75% full, which means x is never greater than 4. Improving the runtime from O(x) to O((log x)^2) doesn't matter when x is so small.

I imagine there are some niche memory-constrained applications where you'd let x get larger but I never ran into them personally.

kevin_thibedeau · on Feb 11, 2025

Robin Hood hash works well with high load factors ~95% because it balances the average lookup with a fair distribution of buckets. That makes it ideal when you don't want to waste memory on bloated tables.

JacksonAllan · on Feb 11, 2025

I think the 0.95 figure for Robin Hood hash tables is rather optimistic. Robin Hood helps in a few ways: it allows for early termination of failed lookups, and it reduces probe-length variance (thereby making the performance of any one hash table operation more predictable). However, it does not reduce the average number of probes needed to perform a successful lookup. The hash-table benchmarks I published last year (https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/#...), which test various hash tables in the 0.44-0.875 load-factor range, show that the performance of Robin Hood tables ("ankerl" and "tsl" in the article) deteriorates rapidly as the load factor climbs high, just like conventional linear- and quadratic-probing tables. This is in contrast to the SIMD and hybrid open-addressing/separate-chaining tables, whose performance is much more stable across the whole load-factor range.

plasticchris · on Feb 11, 2025

In memory constrained environments you just make the underlying array a sparse vector, then it can be mostly empty without using much memory at all.

danpalmer · on Feb 11, 2025

How does this work? Naively I'd have expected that to implement a sparse vector with efficient random access, you'd use a hash table, but I assume this must depend on the sparse vector not being a hash table or otherwise there wouldn't be an improvement. Are there other ways of implementing the efficient random access, or do you sacrifice the performance of random access?

froh · on Feb 11, 2025

adaptive radix trie can be and is used for sparse arrays, O(log(k)) where k is the key size, const for bounded integers

alternatively, depending on your data and the sparseness, bit vectors can indicate filled and empty slots and popcnt be used to find the actual slot for some index.

plasticchris · on Feb 11, 2025

search for sparse hash, there are many examples.

swiftcoder · on Feb 11, 2025

For this sort of high-load factor application, its not so much "memory constrained" in the sense of embedded hardware, as is it "memory constrained" in the sense of "my petabyte database table needs 1 TB just to store the index"...

pclmulqdq · on Feb 11, 2025

I'm pretty sure nobody uses uniform probing hash tables in the wild. Every time I have wanted to have very high load factors (>90%), cuckoo hashing has been good enough, and below 70-80%, linear probing is blazing fast and absolutely good enough.

kragen · on Feb 11, 2025

Linear probing also has better locality of reference.

mordae · on Feb 11, 2025

Last time I needed high occupancy I was for a cache. So I've stirred my 32b keys with Phi-mul-rshift and randomly (xor-shift) displaced picked slot old value by log2(size)/2 with linear probing of up to log2(size).

layer8 · on Feb 10, 2025

In practice the worst-case operations are avoided by reserving a little more space for the hash table. And the new results come at the cost of slower “good case” insertions.

jeffbee · on Feb 10, 2025

Complexity analysis and actual systems programming have been diverging for a while. I don't see anything in the paper that will inform practice.

naasking · on Feb 10, 2025

How so?

zamadatix · on Feb 10, 2025

Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity". The former is probing the edge of mathematics while the latter is probing the edges of practical construction so they rarely seem to align much anymore at these depths.

naasking · on Feb 10, 2025

> Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity"

You're just repeating the same claim. Why is that more and more often then important bit? Why is it more important now? Why are these factors not captured in complexity analysis?

For instance, some complexity analysis assumes random access memory of arbitrary size, but memory above a certain size is better modelled with logarithmic access time. But this too can be captured in complexity analysis, so it's not really evidence of any divergence.

And then you have cache-oblivious data structures that scale uniformly across all cache sizes, which is a product of complexity analysis.

So I'm asking for what exactly is being meant with a justification of why you think this matters now more than it did before.

zamadatix · on Feb 11, 2025

Ignoring the galactic algorithm bit, as it's in the quote but I think you get why it would cause a divide here, just because you could try to do something in a field does not mean that's what is actually being done in a field. This is what I mean by probing the edges of mathematics vs practical construction, using the same toolset to probe isn't enough to unite those two approaches alone.

Look at this paper as an example. Does its worst case Big O analysis attempt to model memory hierarchies for constructable systems or does it further diverge from any hardware considerations in favor of asymptotic behaviors towards infinities for a generic construction?

jeffbee · on Feb 11, 2025

This is a good summary, but getting down to exactly what I meant: there aren't any hash table authors or maintainers in the business who have been stopped from trying something because of Yao's conjecture. So this result is not going to unleash a wave of practical innovation.

fngjdflmdflg · on Feb 10, 2025

>Why you think this matters now more than it did before.

More of the low hanging fruit has been picked over time. The motivation most of the original algorithms for a lot of computer science problems were practical. Once all (or most) of the optimal solutions for practical purposes have been found you are necessarily (or almost necessarily) left with only theoretical solutions.

im3w1l · on Feb 11, 2025

I think because computers are more complicated than they used to be. They have more features that need to be accounted for than they used to. This in turned happened because cpu frequencies stopped increasing so we had to find other ways to make things go faster. Like simd, threads, speculative execution, caches.

menaerus · on Feb 11, 2025

If all of the latencies to the main memory and L1/L2/L3 caches have been only increasing throughout the years, is picking the right data memory layout more or less important than before?

xxs · on Feb 11, 2025

>If all of the latencies to the main memory and L1/L2/L3 caches have been only increasing throughout the years

That's actually not the case at all both L1/L2 (to a lesser extent L3) have been improved dramatically in the past years. L1/L2 are effectively running at a very similar clock to the CPU, e.g. L1 nowadays could be 2 CPU cycles. About the L3, it does depend on the architecture and where the cache coherency occurs - yet generally the latency has not increased.

The main memory (DRAM) latency has been more or less similar, though.

>is picking the right data memory layout more or less important than before?

It has always been important, it's not about the latencies per se but the locality, which has been the case for more than two decades. The general rules are:: 1st, if you data set is small, e.g. N is tiny - it's effectively O(1) (as N is a constant), 2nd) pick bigO that works best [usually O(1) or O(logN)], 3rd) big the datastrtucture with the best locality.

menaerus · on Feb 11, 2025

> That's actually not the case

I don't know where do you get the data but the latencies are surely not decreasing and 2 cycles for L1 is totally wrong. I should have perhaps been more specific that under latency I actually meant latency in terms of CPU cycles not the actual number (ns) which obviously depends on the CPU frequency.

Examining the few past generations of both Intel and AMD microarchitectures this is what I concluded. L1 is between 4 and 6 cycles. L2 on Intel is between 12 and 16 cycles whereas AMD is roughly around that but 1-2 cycles faster. L3 on AMD is ~40 cycles where L3 on Intel goes anywhere between 60 and even up to 120 cycles.

It is an engineering feat to increase the capacity of the cache without sacrificing a little bit of latency. It happens that the latency increase is sometimes less visible due to some other effects but generally speaking this is what happens.

> The main memory (DRAM) latency has been more or less similar, though.

~70ns of DRAM latency in a CPU core running @3Ghz is different than the core running at @5Ghz. Former is ~200 cycles while the latter is ~400 cycles. So, the higher the core clock is, higher the latency is in terms of CPU cycles stalled.

> It has always been important

Yes, I am aware of that and I am only trying to make a case for the parent comment since it was asking for a clarification why would it be more important today than it had been (at all) in the past.

> it's not about the latencies per se but the locality

I think it's vice versa because the actual issue we are trying to mitigate is the high DRAM latency. As a consequence it just happens that we use data locality to minimize those effects.

naasking · on Feb 13, 2025

The answer is obvious, but I don't really consider that evidence of systems programming and complexity analysis diverging. When approaching any given problem, you would still start with a basic complexity analysis to get in the right order of magnitude, and then select among cache-friendly approaches that fall in that range. You almost never go the other way around.

rictic · on Feb 10, 2025

Can the reality of hardware be modeled with complexity analysis? Yes.

Is it, in practice? I haven't seen it.

gauge_field · on Feb 10, 2025

A contributing factor is how complex/smart the current hardware is. It can have cache line size, different forms of hardware/software prefetching, different ports for different ops, memory latency, simd extensions. These leave many opportunities for algorithms to be optimized over. There is also the issue of real life scenarios not matching with asymptotic ones (due to e.g., size of input), which when coupled with previous factors, leads to even more potential optimization schemes. Obligatory reference: https://agner.org/optimize/

nijave · on Feb 11, 2025

A couple things immediately come to mind

1) complexity analysis ignores coefficients which can make a huge difference, especially since computers usually have bounds

2) real life may influence the likelihood of best/worst case. I think data tends to be somewhat sorted in practice so algorithms with best case on sorted data perform better

jonstewart · on Feb 11, 2025

Complexity analysis typically assumes an ideal Von Neumann machine. Systems programming embraces the discontinuities in registers, L1, L2, L3, ILP, branch prediction, and so on. (Maybe there's a good pun to be had that complexity analysis stays as far away from computer complexity as possible.) The simple model is essential, as it's the only way to create a body of proofs that will last. What's increasingly hard, though, is that there are oodles of papers, of the sort demonstrating that one algorithm or data structure is better than another, yada yada, and there's usually some benchmarking in each, and that kind of empirical demonstration is now far less than useful unless the paper's authors are good systems programmers and have given up on improvements.

iforgot22 · on Feb 11, 2025

Exactly, seems like the Von Neumann machine assumption is outdated, especially with modern supercomputers. Memory access itself isn't O(1), it's more like O(N^(1/3)) because space is a serious consideration, and that's your distance from a focal point to all other points within a 3D volume.

xxs · on Feb 11, 2025

Yet, memory access can be limited by latency and bandwidth. Predictable access patterns (e.g. linear scan) tend to enjoy prefetech reduces the latency of accessing random parts.

oulipo · on Feb 10, 2025

Most real-world hash table implementations are not "theoretical" but depends on "real" parameters like L2 cache, assembly instruction sizes, etc

kazinator · on Feb 11, 2025

Real world hash table can either be resized to avoid worst-case degradation or else be provisioned to have a good amount of slack when the worst expected case occurs. (E.g. a hash table used a cache will apply pressure to evacuate old entries before it gets foo full.)

Bucket hashing obviously has no problem finding a free spot. I didn't say chained on purpose because chains can be resizable vectors, so we load at most one pointer when going from table to bucket.

xxs · on Feb 11, 2025

Bucket can evolve to red/black tree, if there are too many entries given O(logK) (collisions) for lookups. Still bucket based ones tend to be outclasses (both latency and memory footprint) by the dumbest linear probe.

kazinator · on Feb 11, 2025

Some hash tables promote updated entries to the front of the chain for faster access next time. It occurs to me that a good old splay tree would do that.

The thing is, you're not supposed to have buckets so large that you need a fancy data structure. If the table is not resizeable, it might not be avoidable.

slt2021 · on Feb 11, 2025

its more like real-world hash tables are tailored to the specific pattern of querying and inserting data.

if you know empirically that 80% of your requests come to a tiny subset of hot keys - you make a specific hashset just for these hot keys with constant access time, while keeping the rest of the table cold.

your L2 cache is an example of such optimization - a high bandwidth and low latency memory on the CPU die, that is orders of magnitude faster than random DRAM access - a tiered memory model

echoangle · on Feb 10, 2025

Isn’t the problem that the scaling behavior only dominates with infinite n?

If you have a constant factor, that doesn’t go into the scaling rule, so having something scale (log x)2 could still be 100 times more expensive than something that scales linearly with x for all x smaller than 2^100.

thfuran · on Feb 11, 2025

It's extremely unlikely that the constants would be nearly that large, but yes.

kragen · on Feb 11, 2025

There are practically important algorithms like this. There are three linear-time algorithms for constructing a suffix array, but all of them have rather large constant factors. Packrat parsing is linear in the size of the input string, but can easily execute more than 100 instructions per byte. And there are numerous theoretically asymptotically superior algorithms whose constant factors are too high for them to be useful at all, the so-called "galactic algorithms", https://en.m.wikipedia.org/wiki/Galactic_algorithm.

There is nothing in the article suggesting that this is such a case, however.

ofirg · on Feb 11, 2025

it improves the worse case cost given a nearly full hash map, it hurts raises the cost in other cases.

rajnathani · on Feb 11, 2025

Also, correct me if I'm wrong, but also there is a slight added memory complexity in adding these tiny pointers?

federiconafria · on Feb 11, 2025

From what I understood, they are just "reserved" areas. e.g. if you have 200 slots, the first 100 are the first "area", the second 50, 25 etc.

coulditbeused · on Feb 10, 2025

Could it be used to optimize battery charging speed? Sounds like there's some parallel but was interested in an informed view.

elihu · on Feb 10, 2025

Perhaps the technique requires a lot of additional metadata, so that you could fit a 50% full "normal" hash table in less memory than it takes to store a 99% full hash table using this new approach. Thus the normal hash table can always outperform the new hash table in practice despite worse big O performance because it doesn't hit the pathological worst case except in situations where the new hash table would have run out of memory.