I've played with it the whole day (so take it with a grain of salt). My gut feel...

I've played with it the whole day (so take it with a grain of salt). My gut feeling is that it can produce a bigger ... "thing". I am calling it a "thing", because it looks very much as what you want, but the bigger it is - the more the chances of it being subtly (or not) wrong.

I usually ask the models to extend a small parser/tree-walking interpreter with a compiler/VM.

Up until Claude 3.7 the models would propose something lazy and obviously incomplete. 3.7 generated something that looks almost right, mostly works, but is so overcomplicated and broken in such a way, that I rather delete it and write it from scratch. Trying to get the model to fix it resulted in running in circles, spitting out pieces of code that didn't fit the existing ones etc.

Not sure if I prefer the former or the latter tbh.