Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.

> The team’s results may not lead to any immediate applications

I don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?



I haven't read the paper, but sometimes asymptotic improvements do not translate to real world improvements due to a large multiplicative factor in the complexity that gets factored out in the O() analysis. So the dataset required to see speed-up is impractically large.


In this case "x" is 1/d where d is the unused fraction of space.

So if you leave 0.1% of your hashtable unused your x is 1000 - quite problematic. However if you leave 12.5% of your hashtable unused your x is 8 - quite reasonable, and not something logarithmic behavior would necessarily speed up, for reasonable constants.


This is pretty much exactly the case for this algorithm. It is a very slow hash table due to the lack of locality. It seems to only have benefits at very large size and very high load factor.

At that scale, it may be practically better to use a B-tree or a radix tree over a 128-bit hash of the key.


I'm not up to date on the state of the art but I've implemented hash tables a few times and we would expand the hash table whenever it was 75% full, which means x is never greater than 4. Improving the runtime from O(x) to O((log x)^2) doesn't matter when x is so small.

I imagine there are some niche memory-constrained applications where you'd let x get larger but I never ran into them personally.


Robin Hood hash works well with high load factors ~95% because it balances the average lookup with a fair distribution of buckets. That makes it ideal when you don't want to waste memory on bloated tables.


I think the 0.95 figure for Robin Hood hash tables is rather optimistic. Robin Hood helps in a few ways: it allows for early termination of failed lookups, and it reduces probe-length variance (thereby making the performance of any one hash table operation more predictable). However, it does not reduce the average number of probes needed to perform a successful lookup. The hash-table benchmarks I published last year (https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/#...), which test various hash tables in the 0.44-0.875 load-factor range, show that the performance of Robin Hood tables ("ankerl" and "tsl" in the article) deteriorates rapidly as the load factor climbs high, just like conventional linear- and quadratic-probing tables. This is in contrast to the SIMD and hybrid open-addressing/separate-chaining tables, whose performance is much more stable across the whole load-factor range.


In memory constrained environments you just make the underlying array a sparse vector, then it can be mostly empty without using much memory at all.


How does this work? Naively I'd have expected that to implement a sparse vector with efficient random access, you'd use a hash table, but I assume this must depend on the sparse vector not being a hash table or otherwise there wouldn't be an improvement. Are there other ways of implementing the efficient random access, or do you sacrifice the performance of random access?


adaptive radix trie can be and is used for sparse arrays, O(log(k)) where k is the key size, const for bounded integers

alternatively, depending on your data and the sparseness, bit vectors can indicate filled and empty slots and popcnt be used to find the actual slot for some index.


search for sparse hash, there are many examples.


For this sort of high-load factor application, its not so much "memory constrained" in the sense of embedded hardware, as is it "memory constrained" in the sense of "my petabyte database table needs 1 TB just to store the index"...


I'm pretty sure nobody uses uniform probing hash tables in the wild. Every time I have wanted to have very high load factors (>90%), cuckoo hashing has been good enough, and below 70-80%, linear probing is blazing fast and absolutely good enough.


Linear probing also has better locality of reference.


Last time I needed high occupancy I was for a cache. So I've stirred my 32b keys with Phi-mul-rshift and randomly (xor-shift) displaced picked slot old value by log2(size)/2 with linear probing of up to log2(size).


In practice the worst-case operations are avoided by reserving a little more space for the hash table. And the new results come at the cost of slower “good case” insertions.


Complexity analysis and actual systems programming have been diverging for a while. I don't see anything in the paper that will inform practice.


How so?


Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity". The former is probing the edge of mathematics while the latter is probing the edges of practical construction so they rarely seem to align much anymore at these depths.


> Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity"

You're just repeating the same claim. Why is that more and more often then important bit? Why is it more important now? Why are these factors not captured in complexity analysis?

For instance, some complexity analysis assumes random access memory of arbitrary size, but memory above a certain size is better modelled with logarithmic access time. But this too can be captured in complexity analysis, so it's not really evidence of any divergence.

And then you have cache-oblivious data structures that scale uniformly across all cache sizes, which is a product of complexity analysis.

So I'm asking for what exactly is being meant with a justification of why you think this matters now more than it did before.


Ignoring the galactic algorithm bit, as it's in the quote but I think you get why it would cause a divide here, just because you could try to do something in a field does not mean that's what is actually being done in a field. This is what I mean by probing the edges of mathematics vs practical construction, using the same toolset to probe isn't enough to unite those two approaches alone.

Look at this paper as an example. Does its worst case Big O analysis attempt to model memory hierarchies for constructable systems or does it further diverge from any hardware considerations in favor of asymptotic behaviors towards infinities for a generic construction?


This is a good summary, but getting down to exactly what I meant: there aren't any hash table authors or maintainers in the business who have been stopped from trying something because of Yao's conjecture. So this result is not going to unleash a wave of practical innovation.


>Why you think this matters now more than it did before.

More of the low hanging fruit has been picked over time. The motivation most of the original algorithms for a lot of computer science problems were practical. Once all (or most) of the optimal solutions for practical purposes have been found you are necessarily (or almost necessarily) left with only theoretical solutions.


I think because computers are more complicated than they used to be. They have more features that need to be accounted for than they used to. This in turned happened because cpu frequencies stopped increasing so we had to find other ways to make things go faster. Like simd, threads, speculative execution, caches.


If all of the latencies to the main memory and L1/L2/L3 caches have been only increasing throughout the years, is picking the right data memory layout more or less important than before?


>If all of the latencies to the main memory and L1/L2/L3 caches have been only increasing throughout the years

That's actually not the case at all both L1/L2 (to a lesser extent L3) have been improved dramatically in the past years. L1/L2 are effectively running at a very similar clock to the CPU, e.g. L1 nowadays could be 2 CPU cycles. About the L3, it does depend on the architecture and where the cache coherency occurs - yet generally the latency has not increased.

The main memory (DRAM) latency has been more or less similar, though.

>is picking the right data memory layout more or less important than before?

It has always been important, it's not about the latencies per se but the locality, which has been the case for more than two decades. The general rules are:: 1st, if you data set is small, e.g. N is tiny - it's effectively O(1) (as N is a constant), 2nd) pick bigO that works best [usually O(1) or O(logN)], 3rd) big the datastrtucture with the best locality.


> That's actually not the case

I don't know where do you get the data but the latencies are surely not decreasing and 2 cycles for L1 is totally wrong. I should have perhaps been more specific that under latency I actually meant latency in terms of CPU cycles not the actual number (ns) which obviously depends on the CPU frequency.

Examining the few past generations of both Intel and AMD microarchitectures this is what I concluded. L1 is between 4 and 6 cycles. L2 on Intel is between 12 and 16 cycles whereas AMD is roughly around that but 1-2 cycles faster. L3 on AMD is ~40 cycles where L3 on Intel goes anywhere between 60 and even up to 120 cycles.

It is an engineering feat to increase the capacity of the cache without sacrificing a little bit of latency. It happens that the latency increase is sometimes less visible due to some other effects but generally speaking this is what happens.

> The main memory (DRAM) latency has been more or less similar, though.

~70ns of DRAM latency in a CPU core running @3Ghz is different than the core running at @5Ghz. Former is ~200 cycles while the latter is ~400 cycles. So, the higher the core clock is, higher the latency is in terms of CPU cycles stalled.

> It has always been important

Yes, I am aware of that and I am only trying to make a case for the parent comment since it was asking for a clarification why would it be more important today than it had been (at all) in the past.

> it's not about the latencies per se but the locality

I think it's vice versa because the actual issue we are trying to mitigate is the high DRAM latency. As a consequence it just happens that we use data locality to minimize those effects.


The answer is obvious, but I don't really consider that evidence of systems programming and complexity analysis diverging. When approaching any given problem, you would still start with a basic complexity analysis to get in the right order of magnitude, and then select among cache-friendly approaches that fall in that range. You almost never go the other way around.


Can the reality of hardware be modeled with complexity analysis? Yes.

Is it, in practice? I haven't seen it.


A contributing factor is how complex/smart the current hardware is. It can have cache line size, different forms of hardware/software prefetching, different ports for different ops, memory latency, simd extensions. These leave many opportunities for algorithms to be optimized over. There is also the issue of real life scenarios not matching with asymptotic ones (due to e.g., size of input), which when coupled with previous factors, leads to even more potential optimization schemes. Obligatory reference: https://agner.org/optimize/


A couple things immediately come to mind

1) complexity analysis ignores coefficients which can make a huge difference, especially since computers usually have bounds

2) real life may influence the likelihood of best/worst case. I think data tends to be somewhat sorted in practice so algorithms with best case on sorted data perform better


Complexity analysis typically assumes an ideal Von Neumann machine. Systems programming embraces the discontinuities in registers, L1, L2, L3, ILP, branch prediction, and so on. (Maybe there's a good pun to be had that complexity analysis stays as far away from computer complexity as possible.) The simple model is essential, as it's the only way to create a body of proofs that will last. What's increasingly hard, though, is that there are oodles of papers, of the sort demonstrating that one algorithm or data structure is better than another, yada yada, and there's usually some benchmarking in each, and that kind of empirical demonstration is now far less than useful unless the paper's authors are good systems programmers and have given up on improvements.


Exactly, seems like the Von Neumann machine assumption is outdated, especially with modern supercomputers. Memory access itself isn't O(1), it's more like O(N^(1/3)) because space is a serious consideration, and that's your distance from a focal point to all other points within a 3D volume.


Yet, memory access can be limited by latency and bandwidth. Predictable access patterns (e.g. linear scan) tend to enjoy prefetech reduces the latency of accessing random parts.


Most real-world hash table implementations are not "theoretical" but depends on "real" parameters like L2 cache, assembly instruction sizes, etc


Real world hash table can either be resized to avoid worst-case degradation or else be provisioned to have a good amount of slack when the worst expected case occurs. (E.g. a hash table used a cache will apply pressure to evacuate old entries before it gets foo full.)

Bucket hashing obviously has no problem finding a free spot. I didn't say chained on purpose because chains can be resizable vectors, so we load at most one pointer when going from table to bucket.


Bucket can evolve to red/black tree, if there are too many entries given O(logK) (collisions) for lookups. Still bucket based ones tend to be outclasses (both latency and memory footprint) by the dumbest linear probe.


Some hash tables promote updated entries to the front of the chain for faster access next time. It occurs to me that a good old splay tree would do that.

The thing is, you're not supposed to have buckets so large that you need a fancy data structure. If the table is not resizeable, it might not be avoidable.


its more like real-world hash tables are tailored to the specific pattern of querying and inserting data.

if you know empirically that 80% of your requests come to a tiny subset of hot keys - you make a specific hashset just for these hot keys with constant access time, while keeping the rest of the table cold.

your L2 cache is an example of such optimization - a high bandwidth and low latency memory on the CPU die, that is orders of magnitude faster than random DRAM access - a tiered memory model


Isn’t the problem that the scaling behavior only dominates with infinite n?

If you have a constant factor, that doesn’t go into the scaling rule, so having something scale (log x)2 could still be 100 times more expensive than something that scales linearly with x for all x smaller than 2^100.


It's extremely unlikely that the constants would be nearly that large, but yes.


There are practically important algorithms like this. There are three linear-time algorithms for constructing a suffix array, but all of them have rather large constant factors. Packrat parsing is linear in the size of the input string, but can easily execute more than 100 instructions per byte. And there are numerous theoretically asymptotically superior algorithms whose constant factors are too high for them to be useful at all, the so-called "galactic algorithms", https://en.m.wikipedia.org/wiki/Galactic_algorithm.

There is nothing in the article suggesting that this is such a case, however.


it improves the worse case cost given a nearly full hash map, it hurts raises the cost in other cases.


Also, correct me if I'm wrong, but also there is a slight added memory complexity in adding these tiny pointers?


From what I understood, they are just "reserved" areas. e.g. if you have 200 slots, the first 100 are the first "area", the second 50, 25 etc.


Could it be used to optimize battery charging speed? Sounds like there's some parallel but was interested in an informed view.


Perhaps the technique requires a lot of additional metadata, so that you could fit a 50% full "normal" hash table in less memory than it takes to store a 99% full hash table using this new approach. Thus the normal hash table can always outperform the new hash table in practice despite worse big O performance because it doesn't hit the pathological worst case except in situations where the new hash table would have run out of memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: