You could at least just compress the data with LZF or any of the really fast tex...

holygoat · on Sept 23, 2016

15 minutes of coding, perhaps.

Then an hour of writing good tests.

Then lots of manual and automated testing on four or five platforms, and fixing the weird issues you get on Windows XP SP2, or only on 64-bit Linux running in a VM, or whatever.

Then making sure you don't regress startup performance (which you probably will unless you have a really, really slow disk).

Then implementing rock solid migration so you can land this in Nightly through to Beta.

Then a rollback/backout plan, because you might find a bug, and you don't want users to lose their session between Beta 2 and Beta 3.

Large-scale software engineering is mostly not about writing code.

acqq · on Sept 23, 2016

> regress startup performance

No, for example, LZ4 is unbelievably fast:

https://github.com/Cyan4973/lz4

almost 2 GB per second in decompression!

I've just tried compressing some backupXX.session file (the biggest I've managed to find, just around 2 MB) and it compressed to 70% of the original, probably not enough to implement the compression -- and I suspect the reason is that the file contains too much base64 encode image files which can't be much compressed?

So the answer to having sane session files can be first to stop storing the favicons (and other images(?)) there? I still believe somebody should analyze the big files carefully to get the most relevant facts. For the start, somebody should make a tool that extracts all the pictures from the .session file (should be easy with python or Perl, for example), just that we know what's inside.

acqq · on Sept 24, 2016

So "somebody" was me at the end, I've rolled up my sleeves and extracted the damn pictures: in my 1.7 MB session file approximately 1 MB were the base64 encodings of exactly 57 jpeg pictures that were all the result of the Google image search (probably of the single search). There were additionally a few pngs, representing "facebook, twitter, gplus" images and one "question mark sign" gif too.

alphapapa · on Sept 25, 2016

Thank you for doing that. This explains much.

For a while now I have been running a cronjob to commit my profile's sessionstore-backups directory to a git repo every 5 minutes.

This is because, occasionally, when Firefox starts, it will load tabs in some kind of limbo state where the page is blank and the title is displayed in the tab--but if I refresh the page, it immediately goes to about:blank, and the original tab is completely lost.

When this happens, I can dig the tabs out of the recovery.js or previous.js files with a Python script that dumps the tabs out of the JSON, but if Firefox overwrites those files first, I can't. So the git repo lets me recover them no matter what.

What I have noticed is that the git repo grows huge very quickly. Right now my recovery.js file is 2.6 MB (which seems ridiculous for storing a list of URLs and title strings), but the git repo is 4.3 GB. If I run "git gc --aggressive" and wait a long time, it compresses down very well, to a few hundred MB.

But it's simply absurd how much data is being stored, and storing useless stuff like images in data URIs explains a lot.

acqq · on Sept 25, 2016

If I understood the intention of the programmers, they simply want to store "everything that can imitate to the server the continuation of the current session" even after the new start of the Firefox (like the restart never happened). The images were sent by Google, but probably remain in the DOM tree which is then written as the "session data" or something like that.

Like you, I also observed that exactly the people who depend on the tabs to "remain" after the restart are those who are hit by the bugs in the "restoration" and as I've said, I believe the users would more prefer to have "stable" tabs and URLs than the "fully restored sessions in the tabs" when all the tabs fully randomly (for them) disappear. Maybe saving just the tabs and URLs separately from "everything session" would be a good step to the robustness (since it would be much less data and much less chance to get corrupted) and then maybe, pretty please, an option "don't save session data" can be in the about:config too)?

Once there's decision to store just the URLs of the tabs as the separate file, the file can even be organized in a way that just the URL that is changed gets rewritten, therefore making the "complete corruption" of the file impossible and also removing the necessity for Firefox to keep N older versions of the big files (which then eventually still don't help the user like you).

alphapapa · on Sept 26, 2016

> Maybe saving just the tabs and URLs separately from "everything session" would be a good step to the robustness (since it would be much less data and much less chance to get corrupted)

Yes, that would be very, very useful. I can get by if the tab's scroll position and favicon and DOM-embedded images--and even formdata--are lost. But if the tab itself is lost, and it was a page I needed to do something with, I may never even realize it is gone...

bsdetector · on Sept 23, 2016

They already have Google's Brotli imported, they'd only need a small tweak to also include the encoder. Or use Snappy which is also in the codebase already.

Add the code that's able to load compressed session backups and leave it in for a couple versions.

Once enough versions have passed enable the code that writes compressed session backups.

It's really not that hard to do unless you want to enable it now.

Yoric · on Sept 23, 2016

Actually, we had a prototype doing this. In the end, we didn't because it broke some add-ons, but it might be time to try again.

Someone · on Sept 23, 2016

Broke some add-ons that read or, worse, write data behind the browser's back, you mean?

Also, what do these add-ons do? The only use case I can think of is figuring out whether the user has a tab open to a given site, and that's going around the browser's security model, so breaking that would be a good thing.