Thanks for the details.
Concerning matching for diffing vs for merging, the differences I can think of are:
- for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,
- for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
- diffing needs to be faster than merging, so the accuracy/speed tradeoffs can be different.
Packaging parsers into separate executables seems like hard work indeed! I assume you also considered fixing the tree-sitter grammars (vendoring them as needed, if the fixes can't be upstreamed)? Tree-sitter parsers are being used for a lot more than syntax highlighting these days (for instance GitHub's "Symbols" panel) so I would imagine maintainers should be open to making grammars more faithful to the official specs. I'm not particularly looking forward to maintaining dozens of forked grammars but it still feels a lot easier than writing parsers in different languages. I guess you have different distribution constraints also.
> - for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,
The leaves are the ones that end up being highlighted in the diff, but the inner nodes play an important role as well. We try to preserve as much of the code structure as possible when mapping the nodes. A developer is unlikely to change the structure of the code just for fun. A mapping with a larger number of structural changes is therefore more likely to be incorrect.
> - for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
We use a pipeline based approach and visualizing the changes is the last step. For some types of changes we don't have a way to visualize them yet (e.g. moves within the same line) and ignore that part of the mapping. We are still trying to get the mapping right though :)
We upstreamed a few bug fixes for tree-sitter itself. The grammars were a bit more complicated because we were just using them as a starting point. We patched tree-sitter, added our own annotations to the grammars and restructured them to help our matching algorithm achieve better results and improve performance. In the end there was not much to upstream any more.
Using a well tested parsing library, such as Roslyn for C#, and writing some code to integrate it into our existing system aligned more with our goals than tinkering with grammars. Context-sensitive keywords in particular were a constant source of annoyance. The grammar looks correct, but it will fail to parse because of the way the lexer works. You don't want your tool to abort just because someone named their parameter "async".
- for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,
- for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
- diffing needs to be faster than merging, so the accuracy/speed tradeoffs can be different.
Packaging parsers into separate executables seems like hard work indeed! I assume you also considered fixing the tree-sitter grammars (vendoring them as needed, if the fixes can't be upstreamed)? Tree-sitter parsers are being used for a lot more than syntax highlighting these days (for instance GitHub's "Symbols" panel) so I would imagine maintainers should be open to making grammars more faithful to the official specs. I'm not particularly looking forward to maintaining dozens of forked grammars but it still feels a lot easier than writing parsers in different languages. I guess you have different distribution constraints also.