Hi, I am the first author of this paper and I am happy to answer any questions. You can find a link to the technical paper here https://arxiv.org/abs/2205.09665.
Hey this is cool, I do the NYT Crossword every day. A few questions.
1. You mention an 82% solve rate. The NYT puzzle gets "harder" each day Monday through Saturday. Do you track the days separately? If so I'd be curious how much of the 18% unsolved end up on Fridays and Saturday. (for anyone who doesn't know the Sunday puzzle is outside of the M-Sat range since its a bigger puzzle).
2. Related to the above Thursday puzzles usually have "tricks" (skipped letters and what not) in them or require a Rebus (multiple letters in one space) - do you handle these at all?
3. Is this building an ongoing model and getting better at solving? Or did you have to seed it with a set of solved puzzles and clues?
2. Our current system doesn't have any handling for rebuses or similar tricks, although Dr. Fill does. I think this is part of why Thursday is the hardest day for us, even though Saturday is usually considered the most difficult.
3. We trained it with 6.4M clues. As new crosswords get published, we could theoretically retrain our model with more data, but we aren't currently planning to do that.
I don't suppose you gave more weight to more recent puzzles? Is there a time period or puzzle setter that was harder to solve because they favored an unusual clue type?
We didn't give more weight to recent puzzles. In fact, we trained on pre-2020 data, validated on data from 2020, and evaluated on post-2020 data.
Our model seems to perform well despite this "time generalization" split, but there are a couple instances where it struggled with new words. For example, we got the answer "FAUCI" wrong in a puzzle from May 2021. Even though Fauci was in the news before 2020, I guess he wasn't famous enough to show up in crosswords, and therefore his name wasn't in our training data.
I think evaluating performance by constructor would be really interesting! But we haven't done that.
For handling cross-reference clues, do you think it would be feasible in the future to feed the QA model a representation of the partially-filled puzzle (perhaps only in the refinement step - hard to do for the first step before you have any answers!), in order to give it a shot at answering clues that require looking at other answers?
It feels like the challenges might be that most clues are not cross-referential, and even for those that are, most information in the puzzle is irrelevant - you only care about one answer among many, so it could be difficult to learn to find the information you need.
But maybe this sort of thing would also be helpful for theme puzzles, where answers might be united by the theme even if their clues are not directly cross-referential, and could give enough signal to teach the model to look at the puzzle context?
One thing I was curious about - the ACPT is a crossword speed-solving competition, with time spent solving a major aspect of total score. How did you approach leveling the playing field between the human and computer competitors?