Searching a Million Lines of Lisp

alphapapa · on Sept 3, 2017

Wilfred is a genius. The OP article is from about a year ago, but it laid the foundation for the Emacs package he just released: http://www.wilfred.me.uk/blog/2017/08/30/helpful-adding-cont...

sooheon · on Sept 3, 2017

Ah this is where I've been seeing that domain before. Yeah, the other package suggest.el is also a gem.

olewhalehunter · on Sept 3, 2017

Tim Teitelbaum, one of the first researchers on IDEs summarized problems like this in a quote from one of his papers.

"Programs are not text; they are hierarchical compositions of computational structures and should be edited, executed, and debugged in an environment that consistently acknowledges and reinforces this viewpoint."

It is unfortunate that most major code editors and IDEs today do not store units of code (like Lisp SEXPs) as databases to be updated as you edit, which would make problems like this or other operations like metaprogramming, formal analysis, or documentation much easier to solve.

skrebbel · on Sept 3, 2017

There are editors that do this. Notable, Jetbrains MPS comes to mind. It looks like a text editor but once you use it you quickly notice that you're actually editing the abstract syntax tree directly.

It's cool, but it has some major downsides too. For example, MPS stores the source as XML, not text (since it isn't text, it's a tree). This makes lots of basic tools we've taken for granted a lot harder, such as git merging etc. They've had to make a custom mergetool just to make basic collaborative coding feasible.

I bet there's other ways around that, all I'm saying is that text has major, major upsides because of the enormous ecosystem support.

db48x · on Sept 3, 2017

> For example, MPS stores the source as XML, not text

If only there were a way to write your code using a uniform tree syntax in the first place...

coldtea · on Sept 3, 2017

>It's cool, but it has some major downsides too. For example, MPS stores the source as XML, not text (since it isn't text, it's a tree). This makes lots of basic tools we've taken for granted a lot harder, such as git merging etc. They've had to make a custom mergetool just to make basic collaborative coding feasible.

Doesn't solving this just require a text-to-AST, AST-to-text input and output step?

Anyplace outside the editor the programmer just sees regular text.

ilikebits · on Sept 6, 2017

The issue with this is that operations on text don't necessarily preserve a valid AST. Doing `git merge` on the plain text of a source file may result in invalid code, at which point you have other annoying questions to answer about how to handle text that doesn't parse into a valid AST.

sitkack · on Sept 3, 2017

It would be nice to have a VCS that could work on the native MPS data structures.

Text is probably the next biggest mistake in programmer productivity after null.

icebraining · on Sept 3, 2017

But why? Tools can parse text just fine, and create its MPS structures in the background to do whatever it needs. Why expose this to the programmer?

sitkack · on Sept 4, 2017

Rich structured editors free one from text and are able to encode other information that is not currently recorded in text formats. Directly operating on structures would free languages from parsing, correctness checking and compiling could occur at every semantically correct operation.

With a rich structure editor that can do merges, the undo history of edit and refactor operations could be persisted and merged into the VCS. Currently this isn't possible. Text is a projection for the page and a lowest common format.

icebraining · on Sept 4, 2017

Directly operating on structures would free languages from parsing, correctness checking and compiling could occur at every semantically correct operation.

Directly operating on structures would mean that you'd have to write an editor, which had to enforce correctness as well. And then you'd have to write a generator to save those structures in some kind of format that could be written to a file and passed around, and a parser to read such format. And check for correctness again, since who knows what generated that file.

As for constant compilation, that already exists, many IDEs have it. That's because parsing text is not actually hard, the other stages are.

With a rich structure editor that can do merges, the undo history of edit and refactor operations could be persisted and merged into the VCS. Currently this isn't possible.

Of course it is, you could write a plugin for any IDE that would record edits and refactor operations and save those in or alongside the text (much like they've have to be save alongside the AST). Of course, that doesn't help if the user does a manual refactor, but that's no different than they choosing a node in the rich editor, deleting it, then manually recreating it in its refactored form.

shalabhc · on Sept 4, 2017

Instead of retrofitting a structured format on top of the current text centric world, can we imagine if a structure centric world would be better? A large number of tools would exist to operate semantically on the same structured format, including editors, versioning systems, grep, etc. Diffs and merges would work better. Languages would define the syntax in terms of a tree input instead of text input, and so on.

icebraining · on Sept 5, 2017

We could, I'm just not convinced it would actually be better than text. Parsing is not a difficult problem.

shalabhc · on Sept 5, 2017

It's not about parsing being difficult or easy (e.g. you would still have to parse an abstract structure into a syntax tree specific to your language semantics). It's about making a structured form be the canonical baseline (instead of the canonical being a 'sequence of lines' i.e. text).

Consider that every programming language and every config language first invents a new syntax to encode a tree like structure (typically using a combination of curly braces, other brackets, keywords, indentation etc.) but the code itself is saved as 'text'. This is a lossy encoding - all a generic reader such as `git` or `grep` can now infer is that the file contains a 'sequence of lines' and can then only offer line based operations (git diffs are line based, grep searches are line based, etc.), when in fact a more meaningful operation would be the tree structure based.

If a tree based format was the canonical baseline, diffs could display the location of the node added (e.g. 'Added <Class X> -> <Function Y>'), without having language specific parsing knowledge. Similarly, most editors could provide 'tree view' and 'jump-next', 'jump-up' etc based on context, again without knowing language specific details. Further, many internal representations of programs (e.g. intermediate representations in compilers) also use trees, and could potentially be exported into one of these forms, to make the plethora of tools work with them.

(BTW, I'm not saying a tree is the best generic structure to replace text, but just using it as an example to argue for advantages of a generalized extensible structure over plain text.)

sitkack · on Sept 5, 2017

Given the number of security vulns that boil down to broken parsing, I don't think this is true. Maybe it isn't difficult _for you_. It is still a difficult problem. By moving to structured editors, many more dimensions of data can be encoded into a program than can be cleanly represented by text.

Why do you argue so vehemently against someone perusing an avenue of research?

noir_lord · on Sept 3, 2017

I thought about this a while ago, it would be nice if you could have all code stored as the AST and the formatting handled on ingestion/export.

It would finally settle the formatting arguments and all the tooling would be able to leverage each others projects.

Even PHP has an internal AST representation these days.

shalabhc · on Sept 4, 2017

Take a look at smalltalk where you edit the live in-memory objects directly.

skybrian · on Sept 3, 2017

I'm not sure why you think they don't? Most IDE's do typically build a language-specific search index and keep it up to date as you edit. (That's the main difference between an IDE and a text editor, though the line is blurrier these days.)

Having taken Teitelbaum's compiler class as an undergrad, I was happy to ditch the IDE he inflicted on us and go back to text editors. Structured code editing is a tricky UI problem and I didn't find a really good IDE until many years later.

There's no one way to edit code. Sometimes refactoring tools work well, but typing text can be quite efficient too. Getting locked into a tree editor at the expression level is no fun.

olewhalehunter · on Sept 3, 2017

>I'm not sure why you think they don't?

If they are they either aren't offering user/programmer access to that database, they aren't doing semantic/type binds, or aren't advertising those features well.

>I was happy to ditch the IDE he inflicted on us and go back to text editors. Structured code editing is a tricky UI problem

It's reconcilable with normal text editing, just update the structure once its valid. I agree that things like block/visual programming can be absurd.

skybrian · on Sept 3, 2017

I'm not sure what you mean by "semantic/type binds", but if you're writing a plugin, IDE's like Eclipse and IDEA do give you access to program syntax via a Java API, and for many languages there is also type-aware indexing. Typically this is exposed to the user as specific queries (such as "go to definition" or "find all usages") and updates ("rename method"). From a UI perspective, more features can be added by writing more plugins and/or improving them.

But this indexing is only on one user's workstation and tends not to scale up well. Updating dependencies or switching to a different branch means rebuilding large parts of the index.

Also, part of the problem is that there is little standardization. Many ecosystems are language, platform, build tool, and/or editor-specific. When you do something new you end up reinventing the wheel.

imtringued · on Sept 3, 2017

Especially in statically typed programs you want to "break" your program for refactoring purposes. Usually I look at a function or class and create new functions/classes and then rename the old one and then fix all the errors one by one by using the new code.

sitkack · on Sept 3, 2017

For those looking for the Teitelbaum paper mentioned above [0]

[0] The Cornell Program Synthesizer: A Syntax Directed Programming Environment https://core.ac.uk/download/pdf/21750999.pdf?repositoryId=14...

pkaye · on Sept 3, 2017

I think Microsoft did that in their Roslyn dot.net compiler. There is a server that is fed source code changes and it updates the AST internally incrementally. The text editor can then make queries against that AST. I believe Microsoft also took a portion of this as added it to Visual Studio Code.

tyingq · on Sept 3, 2017

Github's search went the other direction, where it's decidedly less useful than grep. Case insensitive, and drops search string characters that provide context. Like quotes, =, $, etc.

https://help.github.com/articles/searching-code/#considerati...

vcdimension · on Sept 3, 2017

He should try speeding it up further by compiling to C-code using this: https://github.com/tromey/el-compilador