That code, starting on line 41, has convinced me to continue avoiding Python. Python has went from a clear concise language with one right way to do things, to outpacing C++ when it comes to adding esoteric language features.
that code is pattern matching on an ast node; the complexity is from the data structure involved, not the language. there is no more concise way to match a specific node and get at the relevant fields, just various different forms of verbose code.
If anything, this matching on an AST is very readable and elegant, and that just convinced me that the syntax of Python seems really good to write a parser.
I’m actually genuinely curious what alternative you might present from any language. I always thought that was relatively concise for such a complex data structure.
The reason for using a match statement, even though there is only “one” case, is because it’s matching against a complex and (as you mentioned) deeply nested pattern. Lines 43-56 match the current AST node against the pattern (shown here as pseudo-python):
with open(_) as f:
_ = f.read()
(Where _ can be anything, and f can be any name). The traditional alternative to lines 43-56 would be dozens of lines that check the top level of the AST, and if that matches against what the top level of the pattern is expecting, diving one layer deeper and checking that level against the next level of the pattern, etc, until you reach the leaves of your pattern. Once your familiar with pattern matching syntax, this is much easier to quickly read/grok/audit.
One of the projects I occasionally poke at is a burs (bottom-up rewrite system) generator used to do cost-based tree rewriting. Think this is a concept that might make sense in that context, just need to find the time to play…
Does "code smell" just mean "style I don't like"? Most of your concerns have nothing to do with what the code is actually doing.
Line 43-55 is the scrutinee (pattern) of the case statement. It only binds values to the given variables and enters the block if `node` matches the quite complex structure shown there.
The alternative would be doing something repetitive like this:
if not isinstance(node, WithStmt):
return
expr = node.expr
if not len(expr) == 1:
return
if not isinstance(expr, CallExpr):
return
callee = expr.callee
if not (isinstance(callee, NameExpr) and callee.name == 'open'):
return
... etc ...
You could do the same thing with duck typing and try/except, but the semantics would be different (sometimes you DO want nominal typing, even in python).
Why not just something similar to (forgive the pascal)
if isinstance(node, WithStmt) AND
(len(node.expr) = 1) AND
isinstance(node.expr, CallExpr) AND
isinstance(node.expr.callee,NameExper) AND
(node.expr.callee.name == 'open') then
begin
do stuff here
end;
> I can't even figure out what lines 43-55 do, but that's a lot of nesting
It's a very elegant pattern match on a deeply nested structure. It's declarative, not imperative, so if you are concerned about "nesting complexity", you need not be.
I honestly don't understand what's going on... but in Pascal I'd use short circuit expression evaluation to make the comparisons as I go into the depths.
There's too many layers of nesting, in too foreign a syntax for me to even parse.
I suspect it could be done a clearer manner in pascal.
In addition to some of the other answers about the match statement, mCoding has this excellent video[0] explaining the match statement, specifically using it to match complex AST trees, like Refurb does. Note, not all of these checks use the match statement[1], sometimes it is more verbose to use a match statement vs a normal if/else statement.
From reading these comments, I get the impression that python developers don't really want to modernize their codebases.
This fits with my experience of python developers resisting new language features like types and even python v 3.
I'm an ex-Python developer. (I didn't survive 2 to 3.)
What I see happening is Python getting "improved to death". As people keep adding "features", aka complexity, to Python it's moving away from its niche and coming into more direct competition with other languages and ecosystems (e.g. Rust, Go, Nim, Haskell, OCaml, even Ada and Java!)
From my POV this tool, though technically very neat, does unnecessary work to make things harder to understand.
Try to see it this way: all this Red Queen's Race, this running hard to stay in the same place, the endless "improving" of languages like Python and JS, are an attack on your knowledge and skills: just sitting there you are becoming obsolete not because your knowledge is degrading but because the youngsters keep changing the tune forcing you to learn new dances just to stay relevant.
There is very little new under the sun in IT: things like type checking and inference are decades old.
As someone new to python, I don’t understand this attitude.
I have decades of experience, having shipped code in swift, obj-c, Java, Scala, closure, ocaml, C, C++.
A few weeks ago I was asked to write something substantial in python. I have used python in the past for scripts of < 10 lines, so the syntax is not alien.
I have been extremely impressed by the experience. Async code just works and is very efficient. The match statement is very nice. Interacting with the OS is intuitive and well supported. I’m hard pressed to think of a language that would be more concise for what I’m doing.
I’ve had the same experience with JS recently too. It’s a good language now, and it decidedly wasn’t prior to ES6.
If I wind back 10 years or 20 years, there is just no way the languages as they were as capable as they are now. I can do much more with much less code.
The languages may be more complex, but they enable my code to be less complicated, and me to get more done.
It sounds like you have the experience to use Python well, and haven't (yet) had to work with problematic Python code. I don't mean this dismissively, I think a lot of things are "trajectory dependent" if you will. E.g. I'm just starting out with OCaml after years of mostly Python and I'm pretty stoked, but I haven't yet had time to find the rough spots and warts, eh?
It sounds as if you’ve had to maintain a bad codebase in Python. I don’t think any language will save you from that. OCaml (a contender for my favorite language) certainly won’t. My comment is about what the language does for me when I write my own code.
Dropbox is the only one I know of but as far as I know they have started migrating to Go. I guess they'll never fully migrate because they have a lot of code.
To the example, why recommend replacing a list with a tuple when it is about to be iterated through? I’d choose a tuple if it was to be destructured, but not for this. I know it works either way, but this one feels off.
Because tuples cannot change over time, they are (slightly) faster to create compared to lists. Like you mentioned, they are being created just to be iterated over. It is a small enough of a performance improvement that it more of a style choice then anything.
Bonus fact: You can use set iteration in for loops as well! This has the added benefit of sorting the values as well:
>>> for x in {2, 1, 3}:
... print(x)
1
2
3
You didn't ask for that, but I felt like sharing it, to there you go
Sets are unordered. It doesn't even make logical sense when you think about it, it would make set insertion O(logn) which would absolutely wreck the performance. Your example just happens to accidentally work. Here's an example where it doesn't:
Fun fact: not only is the iteration order of sets in python unordered, it's also undeterministic! A set of the same values, iterated in two python program, could iterate in a different order.
I learned this when a unit test that iterated over a set started failing sporadically :P super confusing bug!
You're right! I guess you can't rely on sets being iterated over in order (not that you ever should rely on that). Moreover I was trying to show that you can use lists, tuples, and sets as temporary iterable objects, though but no-one (that I have seen) uses sets in that way.
Sets are more useful when you want to dedupe unknown data. Defining a set with constants to use immediately in a for loop is not very helpful compared to a tuple or a list.
Also, you can make anything iterable in python if you implement __iter__.
I didn't claim it wasn't O(1). I just said that for sets to automatically be sorted like the parent comment claimed, it would have to be O(logn) insertion since that's the fastest you can insert into a sorted data structure. Obviously sets are not sorted, and thus are able to maintain O(1) insertion.
My guess is that it is because for int `x == hash(x)`. This means that if your integers are smaller than the number of buckets in the hashtable they will fall into buckets in order and the iteration is likely just over the buckets.
For example you can see that by adding a large power of two to the numbers you can see that they are sorted by the remainder (as it seems that the default number of buckets is a power of two):
> for x in {2048 + 1, 256 + 3, 1024 + 2, 512 + 4}: print(x)
2049
1026
259
516
I don't think this is related. `set()` uses `hash(x)` it doesn't care what `id(x)` is. The example I provided works equally well with small numbers and large numbers.
> Because tuples cannot change over time, they are (slightly) faster to create compared to lists. Like you mentioned, they are being created just to be iterated over. It is a small enough of a performance improvement that it more of a style choice then anything.
But that's just wrong. Tuples are not just immutable lists. They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually. Lists are for homogeneous collections and they're meant to be looped over. Immutability is neither here nor there. The performance benefit is negligible and irrelevant.
If you don't believe me, look at how type hints work: an arbitrary-length list of integers is typed like `list[int]`, but you have to write an arbitrary-length tuple of integers like `tuple[int, ...]`. It looks different -- why? Because the way you're meant to use tuples is like `tuple[int, int, str, datetime]`, where there's a fixed number of elements and each one has a distinct meaning, where it usually doesn't make sense to loop over and treat them the same.
It's also why there's list comprehension but no tuple comprehension.
The suggestion in the example is just plain incorrect.
>But that's just wrong. Tuples are not just immutable lists. They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually. Lists are for homogeneous collections and they're meant to be looped over. Immutability is neither here nor there. The performance benefit is negligible and irrelevant.
I've never read this opinion before, so I don't think it was intended from the get-go or became a commonly adopted standard. I can see some sense to it, though.
But lists aren't just for looping. They often get looped over, but that's true for any collection, even dictionaries.
>It's also why there's list comprehension but no tuple comprehension.
That's because (x for x in y) was chosen for making generators. Just stick tuple in front and you've got a tuple comprehension.
>Tuples are immutable sequences, typically used to store collections of heterogeneous data (such as the 2-tuples produced by the enumerate() built-in). Tuples are also used for cases where an immutable sequence of homogeneous data is needed (such as allowing storage in a set or dict instance).
>Tuples are for heterogeneous data, list are for homogeneous data.
>Tuples are not read-only lists.
Immutable sequence is a use-case, but it's a secondary use-case. The primary use-case is for record-type data.
>But lists aren't just for looping. They often get looped over, but that's true for any collection, even dictionaries.
My point was more that tuples are not primarily suited for looping over, not that lists exclusively are.
>That's because (x for x in y) was chosen for making generators. Just stick tuple in front and you've got a tuple comprehension.
But list comprehensions were in the language for several years before generators came around. There was a time when [x for x in foo] worked but (x for x in foo) was a syntax error. Why didn't they make it a tuple at the time? Because that's not what tuples are for.
If you look at everywhere tuples show up in the standard library, it's almost always in some context where looping over the collection is not a primary consideration. The clearest example of this dichotomy is in the string methods: the str.split method gives you a list of strings, because the result can be of arbitrary length and you're likely going to loop over it. But str.partition gives a tuple because there are always exactly 3 elements in the result and you're meant to destructure it.
Note that the current official documentation (which you cite), unlike Guido’s 2003 email (which you also cite), explicitly supports tuples as both heterogenous records and immutable homogenous sequences.
They aren’t just immutable lists logically, but since Python (while it can use a static typechecker for verifying correctness to some extent) doesn’t use type information about what is stored in containers for storage implementation, effectively at runtime they are just immutable lists.
But, even logically, and apart from that implementation consideration, immutable lists are one of the things they are, are intended to be, ans are designed to be used for.
> They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually.
That is another use case, yes. The existence of that use case does not negate the immutable list use case.
> It looks different -- why?
tuple type hints have a form (that you point out) for homogenous arbitrary length immutable lists as well as one for fixed-length potentially-heterogenous ordered collections because tuples are intended to serve both purposes, not just the latter.
> It's also why there's list comprehension but no tuple comprehension.
No, genexps using () and having broader utility is why there are no tuple comprehensions. Heck, if genexps were around first, we might not have list, set, or dictionary comprehensions, either.
Here's another tip of syntactic trivia - you can leave the brackets out entirely to (arguably) make it clearer that your don't care about the data structure, and it will implicitly use a tuple:
>... This has the added benefit of sorting the values as well
I don't think you can rely on this? Happy to be proven wrong, but set does not guarantee iteration order. There may be a distinction if all elements are available at set construction, but that seems like a fiddly rule I would rather avoid.
You are confusing the two different uses of `in`. There's no reason to use a set for a temporary iteration container; but using a set is ideal and - in theory, for large n or expensive equality check - more performant as a temporary containment container:
If you look at the disassembly, there's no difference in python 3 (I only tested 3.7+), so it's a matter of style.
There used to be a difference in python2 -- the list option would turn into n LOAD_CONSTs and a BUILD_LIST, whereas the tuple option would just emit one LOAD_CONST.
import dis
def with_tuple():
for i in (1,2,3):
print(i)
def with_list():
for i in [1,2,3]:
print(i)
print("with_tuple")
dis.dis(with_tuple)
print("with_list")
dis.dis(with_list)
Lists can be mutated and Tuples can't, so when you want a default iterable parameter in a function, for example, you would want to use a Tuple. I assume this particular rule is about trying to stay consistent to that practice, even if you only access the iterator of the list and not the list itself.
I'm not sure I understand what value pathlib is providing here. The explanation is to avoid the context manager / multi-line.
Why though? What is saving you one line really getting? It's not really improving readability in any way.
An alternative one liner could be
contents = open("foo").read(100)
No need to import another library and use its features.
Added to which pathlib has some speed issues. Pathlib has some really neat advantages and should be favoured over most of os library file stuff, but not as a replacement for a simple open.
Python is reference counted and it is very clear in this code that a reference to the file isn't leaked, so the file will be reliably closed at the end of the statement.
Some people don't like to rely on the reference counter but I think in simple cases like this where only a temporary is created it is quite safe.
> it is very clear in this code that a reference to the file isn't leaked
Is it? What if an exception is thrown? See e.g. this SO answer[1]:
> However if an exception is thrown [...] then a reference to the file will be kept as part of the stack trace in the traceback object and the file will not be closed, at least until the next exception is thrown.
So your code, if called from somewhere else inside a try block, leaks a filehandle potentially indefinitely.
> the file will be reliably closed at the end of the statement.
The only guarantee CPython makes is that it may eventually close the file once all references are gone. The fact that it does it immediately and reliably in this specific case is an implementation detail and should not be relied upon.
Use context managers when things need to be closed and save yourself trouble down the line.
I looked at the example for some time now, and I don't understand what exactly the problem is. Is it the missing `import pathlib` and then `pathlib.Path...`? Is using `read_text()` bad style?
I'm by no means a Python expert, but as far as I can tell, those two snippets do wildly different things. `.read(100)` would read 100 bytes from the file, `.read_text()` takes the entire file and returns a decoded string. Running the former on a 100GB file will work OK, running the latter on the same file will lead to OOM (unless you actually have +100GB free memory).
As I understand, the snippets in the suggestions are not supposed to be applied verbatim, but as generic proposals that need to be tweaked for the specific example (or ignored completely).
Note that it doesn't say to replace `.read(100)` with `.read_text()`; it uses the default version (without arguments) for both. (That being said, `read_text` doesn't have a limit argument, but that is a completely different issue.)
So you're right - if you're applying the suggestion verbatim without thinking, you will get incorrect results; but the suggestions were not intended for that.
The example shown on the Refurb GitHub page is a little unfortunate. As you say, there is also a potential scalability problem with the original design if the input file is too large to hold in memory in its entirety.
In the example, that issue could be avoided with minimal cost by reading the file lazily:
with open(filename) as f:
for line in f:
do_same_things_with(line)
The specific change Refurb suggests instead doesn’t have this benefit.
However, in general switching to a lazy design won’t always be as safe and easy as it is here, so I’m not sure what I’d ideally want a tool to say in this situation. You’d need much more sophisticated analysis of the original code’s design if you wanted to do something like warning about unnecessarily reading an entire file and suggesting the lazy alternative. It doesn’t seem like a good idea to warn about all uses of the functions that will read the whole file at once, because often with Python scripts that’s a simple and effective way get the required job done and the added complexity from a more scalable design isn’t needed or justified.
The open() variant reads the first 100 text characters, not 100 bytes, because open() works in “text mode” by default, so the two are not _that_ different. Your generally point is still very correct though, the memory issue is the main problem with the “suggestion” here.
Yes, I completely missed the `100` parameter to `read`. And as a sibling comment said, I also understand them to be suggestions, but in this case, I think `read()` and `read(param)` should be treated as two separate cases, or at least this caveat made clear.
Also, I don't know if this was part of your critique, but are you also mentioning that the "a" variable is not in the suggestion? The reason behind that is long names, such as "super_special_file" would greatly increase the length of the error messages, making it hard to see the "meat" of what is being changed. Hence the x and y naming.
Perhaps I should add a flag to allow verbatim (or close to it) errors. Recreating the source lines exactly will be pretty hard, but I think we can do better here.
Perhaps substitute the actual variables if they are simple and short, otherwise do something like:
> main.py:1:1 [FURB101]: Use `y = pathlib.Path(x).read_text()` instead of `with open(x, ...) as f: y = f.read()`, where y=my_important_data and x=super_special_file
Btw: thanks, your tool suggested a handful of things I didn't know about, and pushed me towards using pathlib instead of relying on my muscle memory to do the laborious `os.path.join(..., ...)`.
One request I would have is to spell out the full name of suggested imports. I had never heard of `contextlib.suppress`. And in the above message, I think it will be obvious that you could do `from pathlib import Path` instead of the literal text in the suggestion.
I think this would be a good change. I agree that the imported names should be spelled out: You shouldn't have to go hunting down the docs before you go and change anything. Again, this would increase horizontal space, but I think it is for the better (the errors are already pretty long anyways).
I understand that some of these suggestions have been made possible by recent changes to Python 3. But how do I know whether a specific change suggestion will genuinely improve my codebase vs. just confuse Python developers who are used to seeing things implemented a certain way (e.g., Path lib vs open() )? It would certainly make sense to add more information on the tradeoffs and what long term benefits writing it in a certain way will be.
I agree, better clarification/classification of errors would be nice, as well as the different tradeoffs you should expect from using a particular check. In particular with the Path() related checks, they can seem contrived when used in isolation (open() is a pretty common idiom, like you said), but they become really powerful when you chain them together:
Similar to black, Refurb is opinionated. There are probably going to be checks which you will ignore. Somewhat of a non-goal with Refurb is bringing to light some of the different ways of writing Python code, and the pathlib module is, IMHO, a very underutilized part of the stdlib.
I agree, pathlib is really powerful, and I can see myself using it for new developments.
May I then assume that Refurb is better suited to work on new code? To get back to open(), I don’t know if I would want to go through my company’s legacy code (which runs on Python 3.10) and start replacing battle-tested code everywhere. Same would go for tuple literals over lists in certain circumstances.
I think this will go really well with choosing an „acceptable“ subset of Python for your project, similar to how some companies choose a subset of C++ and then stick with that throughout a project.
I think that using Refurb on new projects would be a great use case, but even for existing/old/legacy projects, Refurb can still be a good option, it really just depends on what is already there.
I agree that going and updating your company's codebase would probably be a bad idea, considering that there are some kinks here and there. Refurb tries its best, but it is very early on in its development, so errors that are emitted should be taken as suggestions, not gospel.
Have to check it out. I wish there was a reasoning behind the suggestions though. Why choose a tuple instead of a list? I won't learn much from just following along. I am also much more likely to change something when I know the reason.
Yes, there is a --explain flag for explaining the checks in more detail. I thought about adding a "use --explain ERR for more info" at the bottom if there is 1 or more errors, probably would be best to add this in.
Are the authors aware of Sourcery[^1]? I've been using it for a long time to clean up and modernize my Python codebases, and am wondering how Refurb can either supplant Sourcery or augment my usage of it?
I have never heard of Sourcery, but I will take a look at it. From what I can tell, Refurb is a CLI only tool (at least, for now), whereas Sourcery is more aimed at IDEs. Refurb should play along nicely with most other linters, though you may need to configure things first.
I got a good chuckle out of them having a PyCharm plugin. I'm not going to create an account just to kick the tires on that, but I'd be stunned if they out-magick PyCharm
I'm just getting into the python ecosystem in a professional capacity. How does refurb compare to the landscape of other python linters and static analyzers? Is it meant to be used in concert with others or to be more of a one stop shop?
It looks like the checks are in the category of standardizing on modern python syntax features, sort-of one step up from an autoformatter like yapf or black. Is this correct?
Yes, I would say that is correct! Like I mention at the very bottom of the README, Refurb is not meant to be an all-in-one linter or formatter: Tools like black, flake8, mypy, etc. already exist, so trying to recreate what they do would be a lot of work. Refurb is more of an addition on top of the tools I previously mentioned, and less of an altogether replacement.
Interesting, I have never heard of this project before! From looking at the README, Refurb is more focused on simplifying the codebases, making them more expressive, whereas pyupgrade seems to have the goal of upgrading you from version A to version B (specifically Python 2 to 3). Near the bottom there are some newer Python version (3.6+) related upgrades, which I was considering adding to Refurb. I expect there to be some overlap between my project and some of the existing projects out there, so I will have to weight the costs of adding these in vs leaving them out.
The reason I didn't just create a pylint/flake8/mypy plugin is because they all use a very specific methodology, whereas I wanted the freedom to architect it the way I wanted. Also, I really needed good type information, which is why I used mypy[0] as the AST parser/type checker.
I don't know what it takes to make a pylint plugin, but maybe it's cleaner to have a standalone `refurb` and then a `pylint-refurb` could be very simple?
I didn't know about read_text() either and I was even somewhat embarrassed when I looked it up in the docs and discovered that it (and its bytes/write counterparts) were added in 3.5 which is basically forever ago.
why replace open() with the Path().read_text()? the open() API transcends Python and is almost ubiquitous. that suggestion seems frivolous.
while we are talking modern Python, the type hints fever seems to be receding. such a good thing in my opinion. i expected this tool to bark about types but am glad the author didn't go that far.
Something that I forgot to mention in the other answers similar to this is that the Path().read_text() check only applies if the open() block is a single line, and that line is only used to read the contents of the file. If you are doing other things in that block, no error will be emitted. I agree that open() is pretty ubiquitous, but in certain cases (ie, if I am already using a Path object), using read_text() is a nice one-liner.
read_text() is simpler and _much_ clearer (and thus, Pythonic) when you're doing trivial file operations like reading the contents into a variable. Opening up a whole context manager is still an option when you have to do more complicated stuff to the file.