Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So this thing is based on Kiwix, which is based on the ZIM file format.

In the meanwhile, wikipedia ships wikidata, which uses RDF dumps (and probably 8x less compressed than it should be).

https://www.wikidata.org/wiki/Wikidata:Database_download

There is room for a third option leveraging commercial columnar database research.

https://adsharma.github.io/duckdb-wikidata-compression/



And for those who are only vaguely familiar, this ZIM file format is not the same as the https://zim-wiki.org one.


I am actually only vaguely familiar and I was wondering about that every time I saw the format referenced but never bothered to check, your comment is informative!


Yeah, I'm a long time user/disciple of https://zim-wiki.org ; it was basically Obsidian but 15-20 years early. To do some of the things that are now trivially easy with Obsidian I learned scripting and such, so I'm familiar with this very weird coincidence/name collision.


> and probably 8x less compressed than it should be

ZIM uses zstd so it is pretty compressed--but the thing that takes a lot of room is actually the full-text search index built in to each ZIM file.

Unfortunately the UI of kiwix-serve search doesn't take full advantage of this and the search experience kinda sucks...

Have you done anything useful with RDF? Seems like it is just one of those things universities spend money on and it doesn't really do anything


I really curious about what the world of archival formats is like - is there consensus? are the most-used formats actually any good and well-supported,and self documenting?


Library of Congress has some well considered recommendations for archival. https://www.loc.gov/preservation/resources/rfs/TOC.html

For web content they recommend gzipped WARC. This is great for retaining the content, but isn’t easy to search or render.

I do WARC dumps then convert those to ZIM for easier access.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: