How about this: 29 out of 30 times, save only a diff to the previous data. 1 out of 30 times, save the complete data in compressed form.
(I'm guessing there must already be functionality to diff a bunch of JSON somewhere in the millions of lines of code).
Though I'm sure this doesn't make usually make a dent in a SSD's lifetime. But there are still people running Firefox on low end Android phones with meager flash, and Raspberry Pis with SD cards.
> How about this: 29 out of 30 times, save only a diff to the previous data. 1 out of 30 times, save the complete data in compressed form.
>
> (I'm guessing there must already be functionality to diff a bunch of JSON somewhere in the millions of lines of code).
That is actually a good idea we haven't considered yet. A bit too brute force for my tastes, but relatively easy to implement. We would need to determine how much CPU is needed for a diff between two 300Mb JSON files, though (yes, some users have these).
Of course, we're back to the issue of manpower, but definitely worth trying out.
> Though I'm sure this doesn't make usually make a dent in a SSD's lifetime. But there are still people running Firefox on low end Android phones with meager flash, and Raspberry Pis with SD cards.
The implementation of Session Restore for Android is largely independent, so I'm not sure how it works these days.
Surely you'd diff the data structures in memory, not the serialized JSON; and would it really be faster to blindly write the 300mb each time than perform this diff?
Why is there so much data in the session restore anyway? If the goal is to have the URLs of the currently opened tabs, I'd expect that just the given URLs should be enough? I think I've seen some unexpected stuff there like the images base64 encoded? Maybe there's enough users that would be satisfied just with the URLs? At least for them the "rewrite" would be seldom needed.
Or, maybe to reformulate, which wild scenarios does Firefox want to support now? I can imagine that the user's experience wouldn't match the wishes. Some people that use session restore claimed they "lost everything" from time to time, and I had to fish "just the urls" from their session store files which looked strange ("full of everything"), but automatically restored to nothing.
Session store contains open tabs, windows, history for each tab, form fields, referrers (so we can re-request the page correctly), titles (so we can restore tabs without re-fetching every page), favicons, charsets, some timestamps, extension data, some kinds of site storage, scroll positions, and a few other things.
The goal of session restore is to restore your session -- your open tabs should come back, the same pages should load, scrolled to the same place, and with the right content.
I'd wish they'd also restore themselves to the proper location on the proper virtual desktop. With hundreds of windows, typically organized by task, I dread having to reorganize things every time Firefox (or the computer) restarts.
Someday the pain may motivate me to try to learn enough about Firefox's internals to do it. :)
Definitely worth filing a bug in bugzilla if they are not on the right virtual desktop... it might take years to get fixed, but they do generally get round to it.
Storing a lot of static images as JSON base64 encoded, every 15 seconds is certainly not something that users should be blamed ("some users have 300MB JSON files"), just the poor programming.
It would be interesting if somebody would actually analyze what takes the most of the mentioned 300 MB. I see a lot of base64 encoded stuff, if they are "favicons," come-on. There are so many caches in Firefox already, JSON files certainly aren't the place for these images.
700 KB of binary images in a 1.7 MB session file, which can be compressed only to the 70% of its size.
I also see a lot of things like \\u0440 which spends eight characters for one unicode character (in another file, not from me). But that file was reduced to 37% of initial size with LZ4. It seems LZ4 is still worth doing, if the content remains easily accessible with the external tools, e.g. lz4cli.
As an Fx user, I'm glad that favicons are stored though. The icons are a much easier, quicker indicator of what the tab is, when I'm scrolling through dozens of tabs to see which one to click and load.
There's no need to store favicons in the session JSON. They're stored in the browser cache. If the cache gets cleared in the meantime, they can be redownloaded.
Agreed. If my computer crashes (once in a blue moon) I would be happy for it to just open the urls again. I don't care whatsoever about it keeping the current state of every tab. If any users do care about that, it should be optional (perhaps a tickbox in the preferences window).
My FF has crashed many times on OS X and while it does restore the URLs, it has never restored the state. I know it because sometimes I am in the middle of writing a comment and it crashes. On recovery, the comment is gone.
Then lots of manual and automated testing on four or five platforms, and fixing the weird issues you get on Windows XP SP2, or only on 64-bit Linux running in a VM, or whatever.
Then making sure you don't regress startup performance (which you probably will unless you have a really, really slow disk).
Then implementing rock solid migration so you can land this in Nightly through to Beta.
Then a rollback/backout plan, because you might find a bug, and you don't want users to lose their session between Beta 2 and Beta 3.
Large-scale software engineering is mostly not about writing code.
I've just tried compressing some backupXX.session file (the biggest I've managed to find, just around 2 MB) and it compressed to 70% of the original, probably not enough to implement the compression -- and I suspect the reason is that the file contains too much base64 encode image files which can't be much compressed?
So the answer to having sane session files can be first to stop storing the favicons (and other images(?)) there? I still believe somebody should analyze the big files carefully to get the most relevant facts. For the start, somebody should make a tool that extracts all the pictures from the .session file (should be easy with python or Perl, for example), just that we know what's inside.
So "somebody" was me at the end, I've rolled up my sleeves and extracted the damn pictures: in my 1.7 MB session file approximately 1 MB were the base64 encodings of exactly 57 jpeg pictures that were all the result of the Google image search (probably of the single search). There were additionally a few pngs, representing "facebook, twitter, gplus" images and one "question mark sign" gif too.
For a while now I have been running a cronjob to commit my profile's sessionstore-backups directory to a git repo every 5 minutes.
This is because, occasionally, when Firefox starts, it will load tabs in some kind of limbo state where the page is blank and the title is displayed in the tab--but if I refresh the page, it immediately goes to about:blank, and the original tab is completely lost.
When this happens, I can dig the tabs out of the recovery.js or previous.js files with a Python script that dumps the tabs out of the JSON, but if Firefox overwrites those files first, I can't. So the git repo lets me recover them no matter what.
What I have noticed is that the git repo grows huge very quickly. Right now my recovery.js file is 2.6 MB (which seems ridiculous for storing a list of URLs and title strings), but the git repo is 4.3 GB. If I run "git gc --aggressive" and wait a long time, it compresses down very well, to a few hundred MB.
But it's simply absurd how much data is being stored, and storing useless stuff like images in data URIs explains a lot.
If I understood the intention of the programmers, they simply want to store "everything that can imitate to the server the continuation of the current session" even after the new start of the Firefox (like the restart never happened). The images were sent by Google, but probably remain in the DOM tree which is then written as the "session data" or something like that.
Like you, I also observed that exactly the people who depend on the tabs to "remain" after the restart are those who are hit by the bugs in the "restoration" and as I've said, I believe the users would more prefer to have "stable" tabs and URLs than the "fully restored sessions in the tabs" when all the tabs fully randomly (for them) disappear. Maybe saving just the tabs and URLs separately from "everything session" would be a good step to the robustness (since it would be much less data and much less chance to get corrupted) and then maybe, pretty please, an option "don't save session data" can be in the about:config too)?
Once there's decision to store just the URLs of the tabs as the separate file, the file can even be organized in a way that just the URL that is changed gets rewritten, therefore making the "complete corruption" of the file impossible and also removing the necessity for Firefox to keep N older versions of the big files (which then eventually still don't help the user like you).
> Maybe saving just the tabs and URLs separately from "everything session" would be a good step to the robustness (since it would be much less data and much less chance to get corrupted)
Yes, that would be very, very useful. I can get by if the tab's scroll position and favicon and DOM-embedded images--and even formdata--are lost. But if the tab itself is lost, and it was a page I needed to do something with, I may never even realize it is gone...
They already have Google's Brotli imported, they'd only need a small tweak to also include the encoder. Or use Snappy which is also in the codebase already.
Add the code that's able to load compressed session backups and leave it in for a couple versions.
Once enough versions have passed enable the code that writes compressed session backups.
It's really not that hard to do unless you want to enable it now.
Broke some add-ons that read or, worse, write data behind the browser's back, you mean?
Also, what do these add-ons do? The only use case I can think of is figuring out whether the user has a tab open to a given site, and that's going around the browser's security model, so breaking that would be a good thing.
You also need to handle all corner cases where one of the intermediary diffs gets corrupted (you can't generally assume in a program like Firefox that data you write out is going to be readable in the future because lol common hardware). Or where the diffs are larger than a fresh snapshot. And you absolutely can't get it wrong.
It's something that sounds easy until you actually try to get it coded up and shipping.
I explained below that this thing isn't a factor for Android because the program gets suspended.
In the general case that's a huge pain. In this case, where you're writing a single blob to disk, you can stick a checksum at the end and you're good to go.
It doesn't talk about the difficulties in getting data safely to disk. There's just a worry that taking a hash of the entire session state is expensive. I'm skeptical that a fast hash would take long compared to the time spent serializing to JSON in the first place, let alone time spent diffing.
Firefox does generally assume that the filesystem is reliable (as long as you use fsync properly etc). Witness eg its reliance on SQLite for data storage.
I'm curious, could you expand on this comment? Do you mean that you run Firefox from a profile loaded over SSHFS? I can only imagine that being unbearably slow because of I/O latency. And shouldn't disabling caching in SSHFS make it much, much worse?
As Microsoft have discovered in their binary patches through Windows update. If you've ever run sfc /scannow and then dism /online /cleanup-image /scanhealth then you are almost certainly seeing corrupted binary patches.
(I'm guessing there must already be functionality to diff a bunch of JSON somewhere in the millions of lines of code).
Though I'm sure this doesn't make usually make a dent in a SSD's lifetime. But there are still people running Firefox on low end Android phones with meager flash, and Raspberry Pis with SD cards.