More

wetneb · 2024-11-09T20:01:11 1731182471

In Mergiraf, as soon as there is a parsing error in any of the revisions, it falls back on line-based merging, even though tree-sitter is generally good at isolating the error. It felt like the safest thing to do (maybe we detected the language wrong), but I'm definitely open to reconsidering…

wetneb · 2024-11-09T19:02:27 1731178947

Yeah at the moment it just supports whatever the tree-sitter parser accepts, period. A bring-your-own-grammar version could be interesting, I don't see why it couldn't work. Do you have any Rust crates to recommend, to do parsing according to a grammar supplied by the user at run time? It's likely to be slower, but maybe not prohibitively so…

Another approach would be for the tool to accept doing structured merging even if there are error nodes in the parsed tree. If those error span the parts of the file where the extended language is used, then the tool could still help with merging the other parts, treating the errors as atomic blocks. I'd be a bit reluctant to do that, because there could be errors for all sorts of other reasons.

papashell · 2024-11-10T00:12:18 1731197538

Since tree sitter parsers output a c library, you could dynamically load it.

The rust bindings themselves are a thin ffi wrapper.

If you wanted to make it a little smoother than needing to compile the tree sitter syntax you could compile/bundle grammars up with wasm so its sandboxed and cross platform

Edit: found this vscode extension that dynamically loads syntaxes compiled to wasm. You should be able to do the same thing in rust: https://github.com/selfint/vscode-tree-sitter

wetneb · 2024-11-09T16:00:37 1731168037

Lombok is an interesting example, but yes, just with reflection you can already get order-dependent behaviors as the docs note. I've been thinking about giving users more control over this commutativity, but it's not clear to me what it should look like. A strict mode where commutativity is disabled entirely? The ability to disable certain commutative parents?

wetneb · 2024-11-09T15:41:10 1731166870

Thanks for the insightful comments! You surely have a lot more experience than me there, but my impression was that producing visual diffs and merging files are tasks that put different requirements on the tree matching algorithms, and Dijkstra-style approaches felt more fitting for diffs than for merging, so that's why I went for GumTree as it seemed to be the state of the art for merging. Does SemanticDiff offer a merge driver? I could only find documentation about diffing on the website.

As to mismatches: yes, they are bound to happen in some cases. Even for line-based diffing, Git uses rather convoluted heuristics to avoid them (with the "histogram" diff algorithm), but they can't be completely ruled out there either. I hope that with enough safeguards (helper to review merges, downstream consistency checks with local fall-back to line-based diffing) they can be lived with. I'm happy to try other matching algorithms if they are more promising though (there isn't much coupling with the rest of the pipeline).

Concerning tree-sitter, I have noticed some small issues, but nothing that was a show-stopper so far. I actually like it that it's designed for syntax highlighting, because it's really helpful that the representations it gives stay faithful to the original source, to avoid introducing reformatting noise in the merging process. Parsers written for a specific language can sometimes be too zealous (stripping comments out, doing some normalizations behind your back). That's a problem in Spork (which uses Spoon, a pretty advanced Java parser). And the uniform API tree-sitter offers over all those parsers is just too good to give up, in my opinion.

DarkPlayer · 2024-11-09T16:48:44 1731170924

I don't think that different algorithms are better for merging or diffing. In both cases, the first step is to match identical nodes, and the quality of the final result depends heavily on this step. The main problem with GumTree is that it is a greedy algorithm. One incorrectly matched node can completely screw up the rest of the matches. A typical example we encountered was adding a decorator to a function in Python. When other functions with the same decorator followed, the algorithm would often map the newly added decorator to an existing decorator, causing all other decorator mappings to be "off-by-one". GumTree has a tendency to come up with more changes than there actually are.

We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet.

The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted.

We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;)

[1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr... [2]: https://go.dev/ref/spec#Types

wetneb · 2024-11-09T17:13:02 1731172382

Thanks for the details. Concerning matching for diffing vs for merging, the differences I can think of are:

- for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,

- for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,

- diffing needs to be faster than merging, so the accuracy/speed tradeoffs can be different.

Packaging parsers into separate executables seems like hard work indeed! I assume you also considered fixing the tree-sitter grammars (vendoring them as needed, if the fixes can't be upstreamed)? Tree-sitter parsers are being used for a lot more than syntax highlighting these days (for instance GitHub's "Symbols" panel) so I would imagine maintainers should be open to making grammars more faithful to the official specs. I'm not particularly looking forward to maintaining dozens of forked grammars but it still feels a lot easier than writing parsers in different languages. I guess you have different distribution constraints also.

DarkPlayer · 2024-11-09T19:11:11 1731179471

> - for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,

The leaves are the ones that end up being highlighted in the diff, but the inner nodes play an important role as well. We try to preserve as much of the code structure as possible when mapping the nodes. A developer is unlikely to change the structure of the code just for fun. A mapping with a larger number of structural changes is therefore more likely to be incorrect.

> - for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,

We use a pipeline based approach and visualizing the changes is the last step. For some types of changes we don't have a way to visualize them yet (e.g. moves within the same line) and ignore that part of the mapping. We are still trying to get the mapping right though :)

We upstreamed a few bug fixes for tree-sitter itself. The grammars were a bit more complicated because we were just using them as a starting point. We patched tree-sitter, added our own annotations to the grammars and restructured them to help our matching algorithm achieve better results and improve performance. In the end there was not much to upstream any more.

Using a well tested parsing library, such as Roslyn for C#, and writing some code to integrate it into our existing system aligned more with our goals than tinkering with grammars. Context-sensitive keywords in particular were a constant source of annoyance. The grammar looks correct, but it will fail to parse because of the way the lexer works. You don't want your tool to abort just because someone named their parameter "async".

wetneb · 2024-11-09T14:39:52 1731163192

It's based on tree-sitter indeed: https://mergiraf.org/adding-a-language.html

wetneb · 2024-11-09T14:26:49 1731162409

Out of curiosity, which languages would you be interested in?

donatj · 2024-11-09T15:23:27 1731165807

Off the top of my head, PHP, TypeScript and INI. JavaScript might cover TypeScript, I don't know.

wlll · 2024-11-09T14:42:26 1731163346

For me, Ruby.

wetneb · 2024-11-09T13:47:43 1731160063

Yes, see this example: https://mergiraf.org/conflicts.html#line-based-merges

wetneb · 2024-11-09T13:40:15 1731159615

There are attempts to do that, such as https://arxiv.org/abs/2111.11904 or https://arxiv.org/abs/2109.00084. I couldn't find any open source implementation though. I'm personally not so keen to use LLMs for merging: I want it to be quick and predictable.

wetneb · 2024-11-09T13:04:57 1731157497

I tried your example but git does create a conflict in my case - but maybe I misunderstood the scenario. Python support can likely be done (I would be thrilled if someone made a PR for it), but I don't know if there is a lot of potential for solving conflicts there: imports can have side effects, function arguments are complicated with the mixture of positional and keyword arguments, decorators are effectful… it seems to me that there is a lot of sensitivity to order in many places.

wetneb · 2024-11-09T12:35:45 1731155745

It's definitely something I would recommend in general, but I'm not sure if it would solve this particular problem (reordering blocks is perhaps a bit bold for a prettifier).

andybak · 2024-11-09T23:07:55 1731193675

Maybe prettifier was the wrong word. I've definitely used code formatting tools that offer sorting of certain syntax elements as a feature. (Python imports in PyCharm springs to mind)