Hacker News new | past | comments | ask | show | jobs | submit login
Rethinking Sanakirja (pijul.org)
71 points by lelf on Feb 7, 2021 | hide | past | favorite | 16 comments



> One thing I’ve learned is that spending time building, benchmarking and profiling basic blocks may seem like it can delay the overall plan; however, in this case, exploring different designs for the basic parts made it clear which features of the basic blocks were varying, and which were necessary, yielding a more modular design in the end, which was key to the performance improvements I got.

This. I try point this out so often and am challenged with the 'premature optimisation...' mantra but in my view this is in no way a premature optimisation but an essential part of the design process.

Thanks to the author for an enlightening article.


> Thanks to the author for an enlightening article.

Thanks, I'm glad it was useful!


On a side note, "sanakirja" is the Finnish word for "dictionary". My casual glance seems to indicate that it is indeed a K-V store backed by B-trees [1] 1: https://pijul.org/2017/03/16/sanakirja/


Good point. As this is a technical post following up on discussions in the Rust community (on Reddit and IRL in France), I forgot to add that word of background (I've fixed this now, thanks!).


Funny. I thought it must be a Sanskrit word, it doesn't sound too Finnish to my untrained (albeit Hungarian) ears.


Sanakirja is compound word: sana meaning word and kirja meaning book.


The "Finno-Hugric" family is a bit controversial. That said, I speak several indo-european languages, and Farsi still sounds very exotic to me.


AFAIK there is no controversy that Finnish and Hungarian are distantly related. There is however ongoing debate whether the Samoyedic branch should be considered equally related as the Finnic and Ugric branches within the group (meaning whether Finno-Ugric and Uralic languages are synonymous or not.)

Nonetheless, in casual usage modern Finnish and Hungarian are completely different. The similarities are revealed only by an advanced study of the grammar and the comparison of the entire vocabularies. (Dis)similarity of an individual term is meaningless.


They are very different but same time very similar as native Finnish speaker it was allways pretty surreal when visiting Hungary I could ask about places like Bajcsy-Zsilinszky and people would undestand me.


Thanks! My understanding was that even these grammatical similarities (between Hungarian and Finnish) could be somewhat coincidental.


This sounds like a peripheral viewpoint whose proponents have nevertheless been active on Wikipedia!


1. When talking about BTreeMaps with zero sized values, I'd instead talk about BTreeSets

2. Where does the 500 tables thing come from?

3. I don't fully understand why the importance of copy-on-write forks is much less important now


1. Correct, my point was to emphasise that the implementation is the same. I've just updated it, thanks!

2. We often make the assumption that pages can be flushed to disk atomically, which is not very far from true on modern disks. Because the only update that really needs to be atomic is for these tables, they're all stored on one page. Since these pointers are all 64 bits in size (since 4k = 2^12, 12 bits in each of these are 0s), after a header of a few bytes, that leaves about 500 positions. The first few are probably more often atomic than the last ones, especially on very old disks.

3. It's still very important, but less so, because the Pijul repositories are much lighter on disk. Copying 1Mb is fast, but copying 100Mb is 100 times slower. I still like this feature, because (1) it costs almost nothing extra and (2) I really like the implementation.


So you're assuming native 4KiB sector support (which requires both modern disks and operating systems)? Traditional 512 byte sectors only allow for 61 tables to be updated atomically.

What happens when you exceed the limit? Does it silently become non-atomic?

Also, I find it a bit strange you're talking about pages, not (disk) sectors.


Well, I must admit that users currently need to know a little bit about how things work in order to use it, it isn't completely foolproof.

As an example, the type of channels/branches in Pijul contains things equivalent to `BTreeMap<String, (BTreeMap<Vertex, Vec<Edge>>, BTreeMap<u64, Change>)>`, and failing to update the pointers can result in corruption. Of course one can provide a safe interface on top of that (which I do in Pijul), but that depends on the types, so it isn't done yet (macros to derive safe implementations would be a really cool project btw).

Edit: to answer your question more specifically, small sectors might make it silently non-atomic, but there are workarounds (for example by using an extra level of indirection).


There is no modern hardware for sale that actually has 512-byte sectors, and hasn't been for many years now. (More than a decade, I think...)

If you don't have 4kiB sector support, then the hardware is typically emulating 512B sectors on top of it's native 4kiB sectors. If the partitions are properly aligned, this still gives the same atomicity functionality.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: