Yes, answers were distilled from a much stronger model. On the one hand, you can argue that this is exactly what the LMSYS, WildBench etc datasets are for (to improve performance/alignment on real-world use cases), but on the other hand, it's clear that training on the questions (most of which are repeatedly used by the (largely non-representative of general population) users of the ChatArena for comparing/testing models) makes ChatArena ELO less useful as a model comparison tool and artificially elevates Gemma 2's ChatArena score relative to its OOD performance.
At the end of the day, by optimizing for leaderboard scoring, it makes the leaderboard ranking less useful as a benchmark (Goodhart's law strikes again). The Gemma team obviously isn't the only one doing it, but it's important to be clear-eyed about the consequences.
On all flights I have taken the flight attendants count the passengers on board before taxiing and check the lavatories. This method wouldn't have worked even with available free seats.
Just had to go back to the gate the other day because the manifest didn’t match the count. I can see them overlooking this with flippant flight attendants, especially on smaller flights, but on my flight we were next in line for take off when they sent us back to the gate. So we almost made it to the sky. Also, I don’t know what happened, they checked one seat and then we were off. Missed connecting flight because of it.
This is also for weight and balance / performance calculations. If the calculations are done for 150 passengers and you count only 149 (or 151), you can't legally take off without new paperwork.
Zopf means “braid” and it also denotes a medium-size bread type, made with some milk and glazed with yolk, shaped like a braid, traditionally eaten on Sunday.