Show HN: Refurb – A tool for refurbishing and modernizing Python codebases

simonw · on Sept 28, 2022

This is the largest codebase I've seen that makes extensive use of the new Python match: statement introduced in Python 3.10 - example: https://github.com/dosisod/refurb/blob/master/refurb/checks/...

dosisod · on Sept 28, 2022

Thank you! I have found that pattern matching has made this project a whole lot easier.

One of the other cool features about this project is how the check() functions are loaded based off the type of the node argument:

https://github.com/dosisod/refurb/blob/master/refurb/loader....

This also supports type unions, so you can register your check for binary and unary expressions by typing node as `BinaryExpr | UnaryExpr`.

datalopers · on Sept 28, 2022

That code, starting on line 41, has convinced me to continue avoiding Python. Python has went from a clear concise language with one right way to do things, to outpacing C++ when it comes to adding esoteric language features.

zem · on Sept 28, 2022

that code is pattern matching on an ast node; the complexity is from the data structure involved, not the language. there is no more concise way to match a specific node and get at the relevant fields, just various different forms of verbose code.

jraph · on Sept 28, 2022

If anything, this matching on an AST is very readable and elegant, and that just convinced me that the syntax of Python seems really good to write a parser.

sireat · on Sept 28, 2022

that AST parsing code looks like something written in Scala ...

seems like Scala is moving to more Python like and Python is moving to add some features from Scala.

Waterluvian · on Sept 28, 2022

I’m actually genuinely curious what alternative you might present from any language. I always thought that was relatively concise for such a complex data structure.

fny · on Sept 28, 2022

They're complaining about types, :=, and match.

ac130kz · on Sept 28, 2022

That code has nothing to do with conciseness, he could've extracted the arguments in separate variables

mikewarot · on Sept 28, 2022

Wow, that's a lot of code smell to me.

Why specify all the specific fragments of mypy.nodes, instead of import *?

Docstrings in the middle of code should just die 1/2 of the screen is wasted

Why use a case statement if it only has one match?

I can't even figure out what lines 43-55 do, but that's a lot of nesting

serpent_skis · on Sept 28, 2022

The reason for using a match statement, even though there is only “one” case, is because it’s matching against a complex and (as you mentioned) deeply nested pattern. Lines 43-56 match the current AST node against the pattern (shown here as pseudo-python):

  with open(_) as f:
    _ = f.read()

(Where _ can be anything, and f can be any name). The traditional alternative to lines 43-56 would be dozens of lines that check the top level of the AST, and if that matches against what the top level of the pattern is expecting, diving one layer deeper and checking that level against the next level of the pattern, etc, until you reach the leaves of your pattern. Once your familiar with pattern matching syntax, this is much easier to quickly read/grok/audit.

UncleEntity · on Sept 28, 2022

So it can do multi-level tree matching?

Very interesting…

One of the projects I occasionally poke at is a burs (bottom-up rewrite system) generator used to do cost-based tree rewriting. Think this is a concept that might make sense in that context, just need to find the time to play…

david2ndaccount · on Sept 28, 2022

> Why specify all the specific fragments of mypy.nodes, instead of import *?

glob imports are discouraged as it is hard to figure out where a name comes from (or if it’s just a typo).

> Docstrings in the middle of code should just die 1/2 of the screen is wasted

You mean the class level doc string? That shows up in the help of that class, so there is nowhere else to put it.

I also don’t see the point of a match with a single case.

hexane360 · on Sept 28, 2022

Does "code smell" just mean "style I don't like"? Most of your concerns have nothing to do with what the code is actually doing.

Line 43-55 is the scrutinee (pattern) of the case statement. It only binds values to the given variables and enters the block if `node` matches the quite complex structure shown there.

The alternative would be doing something repetitive like this:

    if not isinstance(node, WithStmt):
        return
    expr = node.expr
    if not len(expr) == 1:
        return
    if not isinstance(expr, CallExpr):
        return
    callee = expr.callee
    if not (isinstance(callee, NameExpr) and callee.name == 'open'):
        return
    ... etc ...

You could do the same thing with duck typing and try/except, but the semantics would be different (sometimes you DO want nominal typing, even in python).

mikewarot · on Oct 1, 2022

Why not just something similar to (forgive the pascal)

  if isinstance(node, WithStmt) AND
    (len(node.expr) = 1) AND
     isinstance(node.expr, CallExpr) AND
     isinstance(node.expr.callee,NameExper) AND
    (node.expr.callee.name == 'open') then
  begin
    do stuff here
  end;

ttymck · on Sept 28, 2022

> I can't even figure out what lines 43-55 do, but that's a lot of nesting

It's a very elegant pattern match on a deeply nested structure. It's declarative, not imperative, so if you are concerned about "nesting complexity", you need not be.

How would you implement this otherwise?

mikewarot · on Sept 30, 2022

I honestly don't understand what's going on... but in Pascal I'd use short circuit expression evaluation to make the comparisons as I go into the depths.

There's too many layers of nesting, in too foreign a syntax for me to even parse.

I suspect it could be done a clearer manner in pascal.

ttymck · on Oct 12, 2022

I would truly love to see it in pascal. I would be happy to give my opinion on it as a "foreign syntax" to me

dosisod · on Sept 28, 2022

In addition to some of the other answers about the match statement, mCoding has this excellent video[0] explaining the match statement, specifically using it to match complex AST trees, like Refurb does. Note, not all of these checks use the match statement[1], sometimes it is more verbose to use a match statement vs a normal if/else statement.

[0]: https://youtu.be/ASRqxDGutpA?t=470 [1]: https://github.com/dosisod/refurb/blob/master/refurb/checks/...

awestroke · on Sept 28, 2022

From reading these comments, I get the impression that python developers don't really want to modernize their codebases. This fits with my experience of python developers resisting new language features like types and even python v 3.

carapace · on Sept 28, 2022

I'm an ex-Python developer. (I didn't survive 2 to 3.)

What I see happening is Python getting "improved to death". As people keep adding "features", aka complexity, to Python it's moving away from its niche and coming into more direct competition with other languages and ecosystems (e.g. Rust, Go, Nim, Haskell, OCaml, even Ada and Java!)

From my POV this tool, though technically very neat, does unnecessary work to make things harder to understand.

Try to see it this way: all this Red Queen's Race, this running hard to stay in the same place, the endless "improving" of languages like Python and JS, are an attack on your knowledge and skills: just sitting there you are becoming obsolete not because your knowledge is degrading but because the youngsters keep changing the tune forcing you to learn new dances just to stay relevant.

There is very little new under the sun in IT: things like type checking and inference are decades old.

grooot · on Sept 28, 2022

As someone new to python, I don’t understand this attitude.

I have decades of experience, having shipped code in swift, obj-c, Java, Scala, closure, ocaml, C, C++.

A few weeks ago I was asked to write something substantial in python. I have used python in the past for scripts of < 10 lines, so the syntax is not alien.

I have been extremely impressed by the experience. Async code just works and is very efficient. The match statement is very nice. Interacting with the OS is intuitive and well supported. I’m hard pressed to think of a language that would be more concise for what I’m doing.

I’ve had the same experience with JS recently too. It’s a good language now, and it decidedly wasn’t prior to ES6.

If I wind back 10 years or 20 years, there is just no way the languages as they were as capable as they are now. I can do much more with much less code.

The languages may be more complex, but they enable my code to be less complicated, and me to get more done.

carapace · on Sept 30, 2022

It sounds like you have the experience to use Python well, and haven't (yet) had to work with problematic Python code. I don't mean this dismissively, I think a lot of things are "trajectory dependent" if you will. E.g. I'm just starting out with OCaml after years of mostly Python and I'm pretty stoked, but I haven't yet had time to find the rough spots and warts, eh?

grooot · on Oct 4, 2022

It sounds as if you’ve had to maintain a bad codebase in Python. I don’t think any language will save you from that. OCaml (a contender for my favorite language) certainly won’t. My comment is about what the language does for me when I write my own code.

dagurp · on Sept 28, 2022

As the Zen of Python[1] says: "There should be one-- and preferably only one --obvious way to do it." so this is on brand :)

1. https://peps.python.org/pep-0020/

desindol · on Sept 28, 2022

Most early unicorns were or are still python monoliths…

awestroke · on Sept 28, 2022

Oh, really? Which ones?

IshKebab · on Sept 28, 2022

Dropbox is the only one I know of but as far as I know they have started migrating to Go. I guess they'll never fully migrate because they have a lot of code.

LittlePeter · on Sept 28, 2022

Not sure why you are down-voted...

Instagram and Dropbox, both used Python.

asplake · on Sept 28, 2022

To the example, why recommend replacing a list with a tuple when it is about to be iterated through? I’d choose a tuple if it was to be destructured, but not for this. I know it works either way, but this one feels off.

dosisod · on Sept 28, 2022

Because tuples cannot change over time, they are (slightly) faster to create compared to lists. Like you mentioned, they are being created just to be iterated over. It is a small enough of a performance improvement that it more of a style choice then anything.

Bonus fact: You can use set iteration in for loops as well! This has the added benefit of sorting the values as well:

>>> for x in {2, 1, 3}: ... print(x) 1 2 3

You didn't ask for that, but I felt like sharing it, to there you go

minhazm · on Sept 28, 2022

Sets are unordered. It doesn't even make logical sense when you think about it, it would make set insertion O(logn) which would absolutely wreck the performance. Your example just happens to accidentally work. Here's an example where it doesn't:

>>> [i for i in {9,40,49}] [40, 9, 49]

cdrini · on Sept 28, 2022

Fun fact: not only is the iteration order of sets in python unordered, it's also undeterministic! A set of the same values, iterated in two python program, could iterate in a different order.

I learned this when a unit test that iterated over a set started failing sporadically :P super confusing bug!

olejorgenb · on Sept 28, 2022

I think this is just for sets whos items hash derive from `hash(some_string)` (eg.: sets of strings)

This is because the string hash function is randomized on process startup to help prevent DOS attacks: https://stackoverflow.com/q/30585108/1517969

dosisod · on Sept 28, 2022

You're right! I guess you can't rely on sets being iterated over in order (not that you ever should rely on that). Moreover I was trying to show that you can use lists, tuples, and sets as temporary iterable objects, though but no-one (that I have seen) uses sets in that way.

yunohn · on Sept 28, 2022

Sets are more useful when you want to dedupe unknown data. Defining a set with constants to use immediately in a for loop is not very helpful compared to a tuple or a list.

Also, you can make anything iterable in python if you implement __iter__.

BeetleB · on Sept 28, 2022

Set insertion is O(1). This isn't C++.

minhazm · on Sept 28, 2022

I didn't claim it wasn't O(1). I just said that for sets to automatically be sorted like the parent comment claimed, it would have to be O(logn) insertion since that's the fastest you can insert into a sorted data structure. Obviously sets are not sorted, and thus are able to maintain O(1) insertion.

BeetleB · on Sept 28, 2022

My bad. Didn't read the context. You're right - what that comment is suggesting is insane.

evil-olive · on Sept 28, 2022

that only works for small numbers. I suspect it's a side effect of the small integer cache CPython maintains [0]

    >>> for x in {2, 1, 3}: print(x)
    1
    2
    3
    >>> for x in {20, 10, 30}: print(x) 
    10
    20
    30
    >>> for x in {200, 100, 300}: print(x) 
    200
    100
    300

0: https://stackoverflow.com/questions/15171695/whats-with-the-...

kevincox · on Sept 28, 2022

My guess is that it is because for int `x == hash(x)`. This means that if your integers are smaller than the number of buckets in the hashtable they will fall into buckets in order and the iteration is likely just over the buckets.

For example you can see that by adding a large power of two to the numbers you can see that they are sorted by the remainder (as it seems that the default number of buckets is a power of two):

  > for x in {2048 + 1, 256 + 3, 1024 + 2, 512 + 4}: print(x)
  2049
  1026
  259
  516

cmcconomy · on Sept 28, 2022

right, python caches ints from -5 to 256 as singletons, so id(x) for those numbers is constant.

whereas each time you run id() on a number outside that range you will get a different id as a new object is instantiated

kevincox · on Sept 28, 2022

I don't think this is related. `set()` uses `hash(x)` it doesn't care what `id(x)` is. The example I provided works equally well with small numbers and large numbers.

_dain_ · on Sept 28, 2022

> Because tuples cannot change over time, they are (slightly) faster to create compared to lists. Like you mentioned, they are being created just to be iterated over. It is a small enough of a performance improvement that it more of a style choice then anything.

But that's just wrong. Tuples are not just immutable lists. They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually. Lists are for homogeneous collections and they're meant to be looped over. Immutability is neither here nor there. The performance benefit is negligible and irrelevant.

If you don't believe me, look at how type hints work: an arbitrary-length list of integers is typed like `list[int]`, but you have to write an arbitrary-length tuple of integers like `tuple[int, ...]`. It looks different -- why? Because the way you're meant to use tuples is like `tuple[int, int, str, datetime]`, where there's a fixed number of elements and each one has a distinct meaning, where it usually doesn't make sense to loop over and treat them the same.

It's also why there's list comprehension but no tuple comprehension.

The suggestion in the example is just plain incorrect.

vharuck · on Sept 28, 2022

>But that's just wrong. Tuples are not just immutable lists. They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually. Lists are for homogeneous collections and they're meant to be looped over. Immutability is neither here nor there. The performance benefit is negligible and irrelevant.

I've never read this opinion before, so I don't think it was intended from the get-go or became a commonly adopted standard. I can see some sense to it, though.

But lists aren't just for looping. They often get looped over, but that's true for any collection, even dictionaries.

>It's also why there's list comprehension but no tuple comprehension.

That's because (x for x in y) was chosen for making generators. Just stick tuple in front and you've got a tuple comprehension.

_dain_ · on Sept 28, 2022

It's in the official documentation:

https://docs.python.org/3/library/stdtypes.html

>Tuples are immutable sequences, typically used to store collections of heterogeneous data (such as the 2-tuples produced by the enumerate() built-in). Tuples are also used for cases where an immutable sequence of homogeneous data is needed (such as allowing storage in a set or dict instance).

GvR email from 2003:

https://mail.python.org/pipermail/python-dev/2003-March/0339...

>Tuples are for heterogeneous data, list are for homogeneous data. >Tuples are not read-only lists.

Immutable sequence is a use-case, but it's a secondary use-case. The primary use-case is for record-type data.

>But lists aren't just for looping. They often get looped over, but that's true for any collection, even dictionaries.

My point was more that tuples are not primarily suited for looping over, not that lists exclusively are.

>That's because (x for x in y) was chosen for making generators. Just stick tuple in front and you've got a tuple comprehension.

But list comprehensions were in the language for several years before generators came around. There was a time when [x for x in foo] worked but (x for x in foo) was a syntax error. Why didn't they make it a tuple at the time? Because that's not what tuples are for.

If you look at everywhere tuples show up in the standard library, it's almost always in some context where looping over the collection is not a primary consideration. The clearest example of this dichotomy is in the string methods: the str.split method gives you a list of strings, because the result can be of arbitrary length and you're likely going to loop over it. But str.partition gives a tuple because there are always exactly 3 elements in the result and you're meant to destructure it.

dragonwriter · on Sept 28, 2022

Note that the current official documentation (which you cite), unlike Guido’s 2003 email (which you also cite), explicitly supports tuples as both heterogenous records and immutable homogenous sequences.

dragonwriter · on Sept 28, 2022

> Tuples are not just immutable lists

They aren’t just immutable lists logically, but since Python (while it can use a static typechecker for verifying correctness to some extent) doesn’t use type information about what is stored in containers for storage implementation, effectively at runtime they are just immutable lists.

But, even logically, and apart from that implementation consideration, immutable lists are one of the things they are, are intended to be, ans are designed to be used for.

> They're for heterogeneous collections and are meant to be destructured or otherwise have their elements acted on individually.

That is another use case, yes. The existence of that use case does not negate the immutable list use case.

> It looks different -- why?

tuple type hints have a form (that you point out) for homogenous arbitrary length immutable lists as well as one for fixed-length potentially-heterogenous ordered collections because tuples are intended to serve both purposes, not just the latter.

> It's also why there's list comprehension but no tuple comprehension.

No, genexps using () and having broader utility is why there are no tuple comprehensions. Heck, if genexps were around first, we might not have list, set, or dictionary comprehensions, either.

falcor84 · on Sept 28, 2022

Here's another tip of syntactic trivia - you can leave the brackets out entirely to (arguably) make it clearer that your don't care about the data structure, and it will implicitly use a tuple:

>>> for x in 1, 2, 3: ...

5d8767c68926 · on Sept 28, 2022

>... This has the added benefit of sorting the values as well

I don't think you can rely on this? Happy to be proven wrong, but set does not guarantee iteration order. There may be a distinction if all elements are available at set construction, but that seems like a fiddly rule I would rather avoid.

leetrout · on Sept 28, 2022

Careful here- sets are unordered and there is no guarantee when they are or are not going to be ordered as you are demonstrating.

driggs · on Sept 28, 2022

You are confusing the two different uses of `in`. There's no reason to use a set for a temporary iteration container; but using a set is ideal and - in theory, for large n or expensive equality check - more performant as a temporary containment container:

  if x in {1, 2, 3}:
      print(x)

tomn · on Sept 28, 2022

If you look at the disassembly, there's no difference in python 3 (I only tested 3.7+), so it's a matter of style.

There used to be a difference in python2 -- the list option would turn into n LOAD_CONSTs and a BUILD_LIST, whereas the tuple option would just emit one LOAD_CONST.

    import dis
    
    def with_tuple():
        for i in (1,2,3):
            print(i)
    
    def with_list():
        for i in [1,2,3]:
            print(i)
    
    print("with_tuple")
    dis.dis(with_tuple)
    print("with_list")
    dis.dis(with_list)

edit: looks like this changed in python 3.6: https://python.godbolt.org/z/rEfY9GT9W

pantsforbirds · on Sept 28, 2022

Lists can be mutated and Tuples can't, so when you want a default iterable parameter in a function, for example, you would want to use a Tuple. I assume this particular rule is about trying to stay consistent to that practice, even if you only access the iterator of the list and not the list itself.

dragonwriter · on Sept 28, 2022

> To the example, why recommend replacing a list with a tuple when it is about to be iterated through?

My understanding is that tuples are always more efficient and preferred unless mutability is needed.

5d8767c68926 · on Sept 28, 2022

YMMV, but a tuple should be faster than a list for construction and iteration. Though, I believe it is a pretty negligible amount.

physicsguy · on Sept 28, 2022

Similarly, some of these are just style. Using Path(x).read_text() rather than a context manager for opening the file for e.g.

Dunedan · on Sept 28, 2022

As usual be attentive and suspicious when using such tools as its suggestions might not be appropriate. For example let's take the following code:

  with open("foo") as a:
      contents = a.read(100)

Running it through refurb produces the following suggestion:

> main.py:1:1 [FURB101]: Use `y = Path(x).read_text()` instead of `with open(x, ...) as f: y = f.read()`

Twirrim · on Sept 28, 2022

I'm not sure I understand what value pathlib is providing here. The explanation is to avoid the context manager / multi-line.

Why though? What is saving you one line really getting? It's not really improving readability in any way.

An alternative one liner could be

    contents = open("foo").read(100)

No need to import another library and use its features.

Added to which pathlib has some speed issues. Pathlib has some really neat advantages and should be favoured over most of os library file stuff, but not as a replacement for a simple open.

Majestic121 · on Sept 28, 2022

Pathlib opens and closes the file. With your version, unless I'm mistaken the file does not get closed

kevincox · on Sept 28, 2022

Python is reference counted and it is very clear in this code that a reference to the file isn't leaked, so the file will be reliably closed at the end of the statement.

Some people don't like to rely on the reference counter but I think in simple cases like this where only a temporary is created it is quite safe.

rcthompson · on Sept 28, 2022

> it is very clear in this code that a reference to the file isn't leaked

Is it? What if an exception is thrown? See e.g. this SO answer[1]:

> However if an exception is thrown [...] then a reference to the file will be kept as part of the stack trace in the traceback object and the file will not be closed, at least until the next exception is thrown.

So your code, if called from somewhere else inside a try block, leaks a filehandle potentially indefinitely.

[1]: https://stackoverflow.com/a/2404671/125921

nicolaslem · on Sept 28, 2022

> the file will be reliably closed at the end of the statement.

The only guarantee CPython makes is that it may eventually close the file once all references are gone. The fact that it does it immediately and reliably in this specific case is an implementation detail and should not be relied upon.

Use context managers when things need to be closed and save yourself trouble down the line.

_ZeD_ · on Sept 28, 2022

CPython relies on reference count.

other python implementations, like PyPy, does not.

iib · on Sept 28, 2022

I looked at the example for some time now, and I don't understand what exactly the problem is. Is it the missing `import pathlib` and then `pathlib.Path...`? Is using `read_text()` bad style?

capableweb · on Sept 28, 2022

I'm by no means a Python expert, but as far as I can tell, those two snippets do wildly different things. `.read(100)` would read 100 bytes from the file, `.read_text()` takes the entire file and returns a decoded string. Running the former on a 100GB file will work OK, running the latter on the same file will lead to OOM (unless you actually have +100GB free memory).

Basically, they are not equivalent.

BerislavLopac · on Sept 28, 2022

As I understand, the snippets in the suggestions are not supposed to be applied verbatim, but as generic proposals that need to be tweaked for the specific example (or ignored completely).

Note that it doesn't say to replace `.read(100)` with `.read_text()`; it uses the default version (without arguments) for both. (That being said, `read_text` doesn't have a limit argument, but that is a completely different issue.)

So you're right - if you're applying the suggestion verbatim without thinking, you will get incorrect results; but the suggestions were not intended for that.

kevincox · on Sept 28, 2022

I find that linters that have a non-trivial false-positive rate get very annoying very quick.

Chris_Newton · on Sept 29, 2022

The example shown on the Refurb GitHub page is a little unfortunate. As you say, there is also a potential scalability problem with the original design if the input file is too large to hold in memory in its entirety.

In the example, that issue could be avoided with minimal cost by reading the file lazily:

    with open(filename) as f:
        for line in f:
            do_same_things_with(line)

The specific change Refurb suggests instead doesn’t have this benefit.

However, in general switching to a lazy design won’t always be as safe and easy as it is here, so I’m not sure what I’d ideally want a tool to say in this situation. You’d need much more sophisticated analysis of the original code’s design if you wanted to do something like warning about unnecessarily reading an entire file and suggesting the lazy alternative. It doesn’t seem like a good idea to warn about all uses of the functions that will read the whole file at once, because often with Python scripts that’s a simple and effective way get the required job done and the added complexity from a more scalable design isn’t needed or justified.

Writing good style checkers is hard. :-)

uranusjr · on Sept 28, 2022

The open() variant reads the first 100 text characters, not 100 bytes, because open() works in “text mode” by default, so the two are not _that_ different. Your generally point is still very correct though, the memory issue is the main problem with the “suggestion” here.

iib · on Sept 28, 2022

Yes, I completely missed the `100` parameter to `read`. And as a sibling comment said, I also understand them to be suggestions, but in this case, I think `read()` and `read(param)` should be treated as two separate cases, or at least this caveat made clear.

dosisod · on Sept 28, 2022

Also, I don't know if this was part of your critique, but are you also mentioning that the "a" variable is not in the suggestion? The reason behind that is long names, such as "super_special_file" would greatly increase the length of the error messages, making it hard to see the "meat" of what is being changed. Hence the x and y naming.

Perhaps I should add a flag to allow verbatim (or close to it) errors. Recreating the source lines exactly will be pretty hard, but I think we can do better here.

sfink · on Sept 28, 2022

Perhaps substitute the actual variables if they are simple and short, otherwise do something like:

> main.py:1:1 [FURB101]: Use `y = pathlib.Path(x).read_text()` instead of `with open(x, ...) as f: y = f.read()`, where y=my_important_data and x=super_special_file

Btw: thanks, your tool suggested a handful of things I didn't know about, and pushed me towards using pathlib instead of relying on my muscle memory to do the laborious `os.path.join(..., ...)`.

One request I would have is to spell out the full name of suggested imports. I had never heard of `contextlib.suppress`. And in the above message, I think it will be obvious that you could do `from pathlib import Path` instead of the literal text in the suggestion.

dosisod · on Sept 28, 2022

I think this would be a good change. I agree that the imported names should be spelled out: You shouldn't have to go hunting down the docs before you go and change anything. Again, this would increase horizontal space, but I think it is for the better (the errors are already pretty long anyways).

dosisod · on Sept 28, 2022

Yes, Refurb error messages are not gospel: Make sure the changes make sense! Thank you for bring this bug up, I will make sure to address it tomorrow.

justusw · on Sept 28, 2022

I understand that some of these suggestions have been made possible by recent changes to Python 3. But how do I know whether a specific change suggestion will genuinely improve my codebase vs. just confuse Python developers who are used to seeing things implemented a certain way (e.g., Path lib vs open() )? It would certainly make sense to add more information on the tradeoffs and what long term benefits writing it in a certain way will be.

dosisod · on Sept 28, 2022

I agree, better clarification/classification of errors would be nice, as well as the different tradeoffs you should expect from using a particular check. In particular with the Path() related checks, they can seem contrived when used in isolation (open() is a pretty common idiom, like you said), but they become really powerful when you chain them together:

(Path("some/deep/folder").parent / "file.txt").read_text()

Similar to black, Refurb is opinionated. There are probably going to be checks which you will ignore. Somewhat of a non-goal with Refurb is bringing to light some of the different ways of writing Python code, and the pathlib module is, IMHO, a very underutilized part of the stdlib.

justusw · on Sept 28, 2022

I agree, pathlib is really powerful, and I can see myself using it for new developments.

May I then assume that Refurb is better suited to work on new code? To get back to open(), I don’t know if I would want to go through my company’s legacy code (which runs on Python 3.10) and start replacing battle-tested code everywhere. Same would go for tuple literals over lists in certain circumstances.

I think this will go really well with choosing an „acceptable“ subset of Python for your project, similar to how some companies choose a subset of C++ and then stick with that throughout a project.

dosisod · on Sept 28, 2022

I think that using Refurb on new projects would be a great use case, but even for existing/old/legacy projects, Refurb can still be a good option, it really just depends on what is already there.

I agree that going and updating your company's codebase would probably be a bad idea, considering that there are some kinks here and there. Refurb tries its best, but it is very early on in its development, so errors that are emitted should be taken as suggestions, not gospel.

BeetleB · on Sept 28, 2022

I guess it depends on the time scales people are used to, but Pathlib isn't new. It's been around for 8 years.

svilen_dobrev · on Sept 28, 2022

heh

...$ ~/.local/bin/refurb -h

refurb: unsupported option "-h"

...$ ~/.local/bin/refurb --help

refurb: unsupported option "--help"

...$ ~/.local/bin/refurb

refurb: no arguments passed

ah. have fun

ogarten · on Sept 28, 2022

Have to check it out. I wish there was a reasoning behind the suggestions though. Why choose a tuple instead of a list? I won't learn much from just following along. I am also much more likely to change something when I know the reason.

Will follow the project.

dosisod · on Sept 28, 2022

Yes, there is a --explain flag for explaining the checks in more detail. I thought about adding a "use --explain ERR for more info" at the bottom if there is 1 or more errors, probably would be best to add this in.

magarnicle · on Sept 28, 2022

It has an --explain flag for this purpose.

usrme · on Sept 28, 2022

Are the authors aware of Sourcery[^1]? I've been using it for a long time to clean up and modernize my Python codebases, and am wondering how Refurb can either supplant Sourcery or augment my usage of it?

---

[^1]: https://sourcery.ai/

dosisod · on Sept 28, 2022

I have never heard of Sourcery, but I will take a look at it. From what I can tell, Refurb is a CLI only tool (at least, for now), whereas Sourcery is more aimed at IDEs. Refurb should play along nicely with most other linters, though you may need to configure things first.

mdaniel · on Sept 28, 2022

I got a good chuckle out of them having a PyCharm plugin. I'm not going to create an account just to kick the tires on that, but I'd be stunned if they out-magick PyCharm

bityard · on Sept 28, 2022

Sourcery looks to be a commercial product of some kind whereas I can just `pipx install refurb`

c54 · on Sept 28, 2022

I'm just getting into the python ecosystem in a professional capacity. How does refurb compare to the landscape of other python linters and static analyzers? Is it meant to be used in concert with others or to be more of a one stop shop?

It looks like the checks are in the category of standardizing on modern python syntax features, sort-of one step up from an autoformatter like yapf or black. Is this correct?

dosisod · on Sept 28, 2022

Yes, I would say that is correct! Like I mention at the very bottom of the README, Refurb is not meant to be an all-in-one linter or formatter: Tools like black, flake8, mypy, etc. already exist, so trying to recreate what they do would be a lot of work. Refurb is more of an addition on top of the tools I previously mentioned, and less of an altogether replacement.

thumbcore · on Sept 28, 2022

Related -- I'm curious how you'd compare this to pyupgrade https://pypi.org/project/pyupgrade/

dosisod · on Sept 28, 2022

Interesting, I have never heard of this project before! From looking at the README, Refurb is more focused on simplifying the codebases, making them more expressive, whereas pyupgrade seems to have the goal of upgrading you from version A to version B (specifically Python 2 to 3). Near the bottom there are some newer Python version (3.6+) related upgrades, which I was considering adding to Refurb. I expect there to be some overlap between my project and some of the existing projects out there, so I will have to weight the costs of adding these in vs leaving them out.

Thanks for mentioning that project!

ac130kz · on Sept 28, 2022

Why not a Pylint plugin though?

dosisod · on Sept 28, 2022

The reason I didn't just create a pylint/flake8/mypy plugin is because they all use a very specific methodology, whereas I wanted the freedom to architect it the way I wanted. Also, I really needed good type information, which is why I used mypy[0] as the AST parser/type checker.

[0]: https://github.com/python/mypy

claytonjy · on Sept 28, 2022

I don't know what it takes to make a pylint plugin, but maybe it's cleaner to have a standalone `refurb` and then a `pylint-refurb` could be very simple?

ac130kz · on Sept 28, 2022

From the pylint documentation it seems the opposite, you just have to extend some methods, and that's it

brhsagain · on Sept 28, 2022

I like how it teaches me things about Python that I didn't know. Just in the example, I didn't know about Path.read_text or str.startswith(<tuple>).

bityard · on Sept 28, 2022

I didn't know about read_text() either and I was even somewhat embarrassed when I looked it up in the docs and discovered that it (and its bytes/write counterparts) were added in 3.5 which is basically forever ago.

nicoco · on Sept 28, 2022

Nice! Passing it on a small project I work on I learnt 2 things already: `Path.read_text()` and `with suppress(SomeException): ...`.

elanning · on Sept 28, 2022

This looks like a nice piece of work. I hope the author continues on it, or other fun hobby projects.

purrcat259 · on Sept 28, 2022

Having gone recently through a push at $WORK to modernize our 3.7 codebase up to 3.9, I'd love to try using this tool.

Does it really only run one file at a time? I'd love for a recursive folder option.

dosisod · on Sept 28, 2022

Refurb can take any number of files or folders, for example:

refurb file.py folder/ another_folder/

Hope that helps!

blondin · on Sept 28, 2022

why replace open() with the Path().read_text()? the open() API transcends Python and is almost ubiquitous. that suggestion seems frivolous.

while we are talking modern Python, the type hints fever seems to be receding. such a good thing in my opinion. i expected this tool to bark about types but am glad the author didn't go that far.

nice tool.

dosisod · on Sept 28, 2022

Something that I forgot to mention in the other answers similar to this is that the Path().read_text() check only applies if the open() block is a single line, and that line is only used to read the contents of the file. If you are doing other things in that block, no error will be emitted. I agree that open() is pretty ubiquitous, but in certain cases (ie, if I am already using a Path object), using read_text() is a nice one-liner.

bityard · on Sept 28, 2022

read_text() is simpler and _much_ clearer (and thus, Pythonic) when you're doing trivial file operations like reading the contents into a variable. Opening up a whole context manager is still an option when you have to do more complicated stuff to the file.