Zip is….. weird. In some undesirably ways sometimes, and desirable in others. Th...

larkost · on Nov 10, 2021

It is a little more complicated than that: having the index at the end is great for writing tape, but sucks for reading tape. The nice thing of that design is that you can "stream" the data (read one file, write it, and just make some notes on the side for your eventual index). But you can't stream things off (you have to read the index first).

Tar is of course all-around great for tape, since every file is essentially put on by itself (with just a little bit of header about it). Again this is great for streaming things both on and off tape. But you can't do any sort of jumping around. You have to start at the beginning and then go from file to file. This gets even worse if you try to compress it (e.g. .tar.gz), as you then have to decompress every byte to get to the next one.

tarasglek · on Nov 11, 2021

couple of points:

a) zip spec does not require file index to be at the end. Only the pointer to file index

b) You can do streaming decompression of zip as each zip entry has a header + filename inline with data.

c) I came up with this optimization(placing index at front of making file io sequential to make use of OS readahead) for omni.ja files in firefox. It's still a standard zip file, but index lives at the front. Most zip tools can open that file unmodified(tho they sometimes complain).

lazide · on Nov 12, 2021

Ho, learn something new everyday, thanks! Also, for anyone else reading this found this elsewhere and is applicable [https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT]

That is actually applicable to a project I’m working on now, any chance your implementation is in python and open source somewhere?

tarasglek · on Nov 15, 2021

Yes, it's in python :)

https://github.com/humphd/mozilla-central-old/blob/9d4d9f265...

Curious, how is it applicable?

lazide · on Nov 15, 2021

Offline high speed data ingestion of multi-thousand file, multi hundred GB data sets, followed by rapid transfer to permanent online storage (and replication fan-out, etc).

Seems convenient to allow optimization for high speed sequential reads and random read/writes at different parts of the life cycle, along with indexing, crcs, signatures, etc.

One big issue with zip storing the index at the end of course is a truncated file basically lost most of it’s context and is generally unrecoverable even in part, which this could also help with from a durability perspective.

Storing it at the beginning (without a end pointer) opens up the possibility you have a valid looking archive you’re touching that is truncated and missing a lot of data, and won’t know until you look past the end (or validate total bytes or whatever, which doesn’t work well when steaming).

Storing the index at the beginning, pointer and file sig at the end, and all the other format extensions does solve for all this. Which is convenient.

tarasglek · on Nov 18, 2021

neat, let me know if you have any further questions. Would love to make this a more common thing that happens to zip files. As result of me doing this in firefox, zip utilities(aka 7zip) started complaining a lot less about this creative interpretation of the standard :)

petre · on Nov 10, 2021

Also limited to 4Gb unless it's zip64 which is limited to 16 Eb and not supported by all zip implementations.

selfhoster11 · on Nov 10, 2021

Don't tapes support some form of seeking?

Isthatablackgsd · on Nov 10, 2021

What lazide saying is that some important information is stored at the end (as in the very end of the book/movie or at the end of the line). So imagine there is a 1TB .zip archive in the tape, the tape device have to go further (deeper) into that 1TB file to get that last bit of vital data that the user want to see. Normally the vital bit usually at the start of the file (as in the front line/beginning of the book) that the user have the information ready before they could transfer it. But for lazide case, the tape device have to keep reading the entire 1TB zip to get that last vital bit of information which made it slow. It is more like it cannot "skip the line" and would have to go through entire line to get there.

lazide · on Nov 10, 2021

They do - it’s very slow, and entirely linear. It also puts wear on the tape, so if you do it a lot you’ll break the tape in a not-super-long-time.

And since you wouldn’t know where the end is by reading the beginning, you’ll have to keep reading until you hit the right marker - and seek backwards (which you generally can’t read backwards on most tape drives, so you need to jump back and re-read).