I cannot speak for eanzenberg but I think his comment was less about his personal justification and more about the rationalizations that have been used in the history of stats.
Gauss quite openly admitted that the choice was borne out of convenience. The justification using Normal or Gaussian distribution came later and the Gauss Markov result on conditional distribution came even later.
Even at that time when Gauss proposed the loss, it was noted by many of Gauss' peers and (perhaps by Gauss himself) that other loss functions seem more appropriate if one goes by empirical performance, in particular the L1 distance.
Now that we have the compute power to deal with L1 it has come back with a vengeance and people have been researching its properties with renewed almost earnest. In fact there is a veritable revolution that's going on right now in the ML and stats world around it.
Just as optimizing the squared loss gives you conditional expectation, minimizing the L1 error gives you conditional median. The latter is to be preferred when the distribution has a fat tail, or is corrupted by outliers. This knowledge is no where close to being new. Gauss's peers knew this.
3 times yes: "The latter is to be preferred when the distribution [...] is corrupted by outliers."
I am working in chemoinformatics, the main methods used by the academics to regress parameters have not changed in the past 40 years even so we went from small carefully assessed data sets (think 200 experimental points) to larger (10000, sometimes millions) with a lot of outliers from data entry errors, experimental errors, etc.
The end results is that when I see models of interest without the raw data, I reregress the parameters using my own datasets because most of the time you can barely trust them (even if coming from well known research centres).
> Gauss quite openly admitted that the choice was borne out of convenience.
That's quite interesting. Do you have a reference for that?
From my understanding, the popularity of the least squares method came (at least in part) from Gauss' successful prediction of the position of Ceres. Was this just because people not using least squares were not able to calculate it?
It's in the original paper in which he derives the normal distribution. Well worth a read. I last had a copy of it in the fourth basement down in the university library about fifteen years ago - it might be still there.
Not disagreeing with your points about L1 but I want to point out that you can also do things to make L2 more robust to outliers (and have better empirical performace), such as winsorizing the data.
I think just historically it's interesting. Every statistician was using OLS before computers because they could solve it with pen and paper, so when computers came out it was ported over. But with computers you can minimize any loss function.
However it is useful to have a closed form solution because it guarantees you actually minimized it. Other strategies to minimize functions don't guarantee that but they're still extremely useful.