Hacker News new | past | comments | ask | show | jobs | submit login

The basic imperative Python version is much easier to remember and read though, even for not-that-experienced Python programmers. I would expect laypeople to be able to more-or-less figure out what it is supposed to do.

  seen = set()
  with open(filename, "r") as file:
    for line in file:
      if line not in seen:
        print(line)
        seen.add(line)
Often (at least in my experience) this kind of operation is either (a) part of some larger automated data processing pipeline for which it’s really nice to have version control, tests, ... or (b) part of some interactive data exploration by a programmer sitting at a repl somewhere, not just a one-off action we want to apply to one file from the command line.

In those contexts, the Python (or Ruby or Clojure or whatever general-purpose programming language) version is easy to type out more-or-less bug-free from memory, debug when it fails, slot into the rest of the project, modify as part of a team with varied experience, etc. etc.




One advantage is that

  seen.add(line)
can be changed to

  seen.add(hash(line))
which can be significantly more memory efficient for files with long lines.


Or perhaps better, if needs change the seen = set() object can be swapped out for any alternative object seen = foo that provides foo.__contains__ and foo.add methods.

This could involve saving previously seen lines in a radix tree, adding multiple layers of caching, saving infrequently seen lines to disk or over the network, etc. as appropriate for the use case.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: