Python displaced a lot of very expensive proprietary software in the biosciences arena. Ease of use was also a major factor, as many bioscientists have relatively little background in programming, but the ability to escape the world of expensive restrictive software licenses was very attractive to the scientific community, whose historical norms emphasize the open sharing of methods and results:
> "A program that performs a useful task can (and, arguably, should) be distributed to other scientists, who can then integrate it with their own code. Free software licenses facilitate this type of collaboration, and explicitly encourage individuals to enhance and share their programs. This flexibility and ease of collaborating allows scientists to develop software relatively quickly, so they can spend more time integrating and mining, rather than simply processing, their data."
Now there isn't any area of molecular biology and biochemistry that doesn't have a host of Python libraries available to assist researchers with tasks like designing PCR strategies or searching for nearest matches on up to x-ray crystallography of proteins.
C, C++, Fortran are still used, most Python users just don't see it because it's hidden away underneath the calling function.
I've been surprised by the rise of Python in some ways although not at all in others. Languages like C, C++, Fortran, and dare I say it Rust are too low-level in their raw state for numerical computing. You had the US federal government funding language competitions because of this (see: Chapel). Languages like Python and R (and before that things like Lisp) came along and gave people a taste of something different, and it's obvious what people migrated to.
Part of it is timing: multivariate computational statistics (ML/data science/DL/whatever you want to call it) just sort of started taking off in computer science communities before LLVM-based languages like Julia or Nim could get a foothold. OCaml might have fit that niche but never got there because of a desire to take a different path, or take the path more slowly.
So people looked for a nice expressive language, found it in Python, and buried all the messy stuff behind wrapper functions and called it a day. It was furthered along by Matlab being another comparison on the other side -- Python looks kludgy compared to modern Fortran or C, but not compared to Matlab.
All that wrapper time in Python has its costs, so I suspect as limits get pushed further we'll eventually see a migration to something else like Julia or Nim, or something else not on anyone's radar.
One moral to this story is that expressiveness matters. People will go out of their way to avoid talking directly to machines at a low level.
People will go out of their way to avoid talking directly to machines at a low level
I would put it differently. At 30 bugs per kLOC, I'd prefer my codebase expresses a problem & it's solution- and as little below that level as possible.
Each well-vetted layer of abstraction between a scientific programmer and the machine's low level interface eliminates whole classes of bugs that are irrelevant to the problem that user is actually working on.
> but rather the discovery that the ratio is pretty stable
The thing is, it isn't stable. It just doesn't depend on the language, what is very surprising. But it varies enormously from one study to another and, AFAIK, nobody has a good set of factors explaining it.
I don't find it that surprising. I think what programming languages (and styles) do is fill up each line of code with information until a roughly constant level of cognitive effort is required to process that line.
At that constant level of effort, we make a certain constant number of mistakes. And that's what I think these studies show.
Some languages are very dense, others break things down in more lines. Some languages care about hard to control details of your computer's working, others handle that automatically. Some languages come with builtin validators, others let you write any kind of trash and try to make sense of it.
Personally, I suspect the number of bugs per line is defined by social and psychological factors, and what changes from one language to the other is the amount of effort one has to put into testing and debugging. But well, none of this is obvious to me.
> Python looks kludgy compared to modern Fortran or C
I’m not sure I can agree with this. Both Python and Matlab provide very nice, high level ways to interact with multidimensional data using simple syntax. Under the hood, both will wind up using fast algorithms to implement the operations. C and Fortran require much more low-level considerations like manually managing memory, futzing with pointers or indices, and generally writing a lot more boilerplate code to shuffle data around.
Matlab, despite all its quirks, could probably have won if it was open source. It’s got a very long history of use in scientific computation and a large user base despite its high price.
Matlab works fine for anything purely "numerical" but fails hard as soon as you need to do more "general computing". Just string handling for example. Or, as far as I know, it's still not possible to implement a custom CLI interface in a matlab script, like you would with argparse in python.
Matlab also historically was really bad for abstraction and code architecture in general. For example, the hard "1 function per file" rule, which encouraged people to not use functions at all, or if you really had to, write 2 or 3 really huge functions (in separate files). Only in recent years (the past 5 or 10 years) did matlab get OOP stuff (classes) and the option for multiple (private) functions in a single script file (still only one public/exported function is possible per file, because the file name is the function name and matlab uses path-based resolution).
Fortran does not require (nor has much available for) manual memory management, and its array syntax is more convenient than Numpy (and far more convenient than Python without Numpy), obviating any futzing around with pointers or indices.
Fortran is definitely much more high level than C, and way easier to write performant numerics in, than the C family languages.
I don’t follow why its array syntax is easier than python though. They’re mostly very similar and the numpy developers seem to come from a Fortran background.
You may know this, but since you always mentioned Nim & Julia together, it might confuse passers by. Nim does not, in fact, need LLVM (though there is a hobby side project using that). Mainline Nim compiles directly to C (or C++ or Javascript) and people even use it on embedded systems.
What seems to attract scientists is the REPL and/or notebook UI style/focus of Matlab/Mathematica/Python/Julia/R/... As projects migrate from exploratory to production, optimizing for interactivity becomes a burden -- whether it is Julia Time To First Plots or dynamic typing causing performance and stability/correctness problems in Python code or even just more careful unit tests. They are just very different mindsets - "show me an answer pronto" vs. "more care".
"Gradually typed" systems like Cython or Common Lisp's `declare` can sometimes ease the transition, but often it's a lot of work to move code from everything-is-a-generic-object to articulated types, and often exploratory code written by scientists is...really rough proof of concept stuff.
The time to first plots in Julia is drastically lower now. And still, it was something you only paid once per session, due to JIT.
Julia is the first language I find truly pleasant to use in this domain. I am more than happy to pay a small initial JIT overhead in exchange for code that looks like Ruby but runs 1/2 the speed of decent C++.
Plus, lots of libraries are really high quality and composable. Python has exceptionally good libraries, but they tend to be big monoliths. This makes me feel Julia or something like Julia will win in the long run.
Sorry I meant 1/2 the speed or 2x the time, edited :)
Consider that BLAS written in pure Julia has very decent performance. If you are into numerical computing, you will quickly understand this is crazy.
Carefully written Julia tends to be surprisingly fast. Excessive allocations tend to be a bigger performance problem than raw speed. Of course excessive allocations eventually have an impact on speed as well. There are some idiomatic ways to avoid this.
Having taught a number of scientists both pre and post grad, I agree with your take on notebooks/REPLs. Data-scientists are not generalist programmers, in some cases, they are hardly more advanced than some plain end-users of operating systems. They shy away from the terminal, they have fuzzy mental models of how the machine operates.
Being a generalist programmer that sometimes deploys the work that data-scientists craft, I'd really like an environment for this that can compile to a static binary.
Having to compile a whole machine with all the right versions of shared libraries is a terrible experience.
That's a good point about Nim. Nim has a nice set of compilation targets, which I tend to forget.
You might be right about the REPL aspect of things. On the other hand, R took off with a pretty minimal REPL, and my first memories of Python didn't involve a REPL. I think as the runtime increases a REPL becomes less relevant, and it seems like most languages with significant numerical use eventually get a REPL/notebook style environment even if it wasn't there initially.
R had a REPL from day one (or at least near it) because the S it was copying did. You could save your "workspace" or "session" and so on. Just because it was spartan compared to Jupyter or just because that might be spartan compared to MathWorks' GUI for Matlab doesn't alter "waiting/Attention Deficit Disorder (ADD)" aspects.
When you are being exploratory even waiting a half second to a few seconds for a build is enough time for many brains to forget aspects/drift from why they pressed ENTER. When you are being careful, it is an acceptable cost for longer term correctness/stability/performance/readability by others. It's the transition from "write once, never think about it again" to "write for posterity, including maybe just oneself"..between "one-liners" and "formatted code". There are many ways to express it, but it underwrites most of the important "contextual optimizations" for users of all these software ecosystems - not just "speed/memory" optimization, but what they have to type/enter/do. It's only technical debt if you keep using it and often you don't know if/when that might happen. Otherewise it's more like "free money".
These mental modes are different enough that linked articles elsewhere here talk about typeA vs typeB data science. The very same person can be in either mode and context switch, but as with anything some people are better at/prefer one vs. the other mode. The population at large is bimodal enough (pun intended) that "hiring" often has no role for someone who can both do high level/science-y stuff and their own low-level support code. I once mentioned this to Travis Oliphant at a lunch and his response was "Yeah..Two different skill sets". It's just a person in the the valley between the two modes (or with coverage of both or able to switch "more easiliy" or "at all"). This is only one of many such valleys, but it's the relevant one for this thread. People in general are drawn away by modes and exemplars and that represents a big portion of "oversimplification in the wild".
This separation is new-ish. At the dawn of computing in the 50s..70s when FORTRAN ruled, to do scientific programming you had to learn to context switch or just be in the low-level work mode. Then computers got a million times faster and it became easier to have specialized roles/exploit more talent and build up ecosystems around that specialization.
FWIW, there was no single cause for Python adoption. I watched it languish through all of the 90s being largely viewed as too risky/illegitimate. Then in the early noughties a bunch of things happened all at once - Google blessing it right as Google itself took off, numpy/f2py/Pyrex/Cython (uniting rather than dividing like the soon after py2/py3 split), a critical mass of libs - not only scipy, but Mercurial etc., latterday deep learning toolkits like tensorflow/pytorch and the surrounding neural net hype, compared to Matlab/etc. generally low cost and simplicity of integration (command, string, file, network, etc. handling as well as graphics output) - right up until dependency graphs "got hard" (which they are now), driving Docker as a near necessity. These all kind of fed off each other in spite of many deep problems/shortcuts with CPython design that will cause trouble forever. So, today Python is a mess and getting worse which is why libs will stay monoliths as the easiest human way to fight the chaos energy.
Nim is not perfect, either. For a practicing scientist, there is probably not yet enough "this is already done for me with usage on StackOverflow as a one-liner", but the science ecosystem is growing [1]. and you can call in/out of Python/R. I mean, research statisticians still tell you that you need R since there is not enough in even Python...All software sucks. Some does suck less, though. I think Nim sucks less, but you should form your own opinions. [2]
It’s because Matlab (and Mathematica, etc) is proprietary, and therefore you always have to pay the Danegeld. So we use numpy instead because it’s extensible, it uses all the super fast C/C++/FORTRAN stuff on the backend, and is fairly easy to learn.
I actually still would prefer Matlab as the syntax is more compact and natural than numpy (which is like a matlabified Python), but that’s probably just due to more experience in Matlab.
Octave is free software with Matlab syntax and Matlab-style interactivity (autoreload, etc.) I'm not a huge fan of the language (Matlab/Octave) but it certainly does make it quick to whip things up.
It sucks compared to Matlab, though. Unfortunately. (Sci lab is better, altho not compatible.) But I have also used it in a pinch. Size of the community means Matlab or Numpy are your best options. If you aren’t happy with Matlab due to cost or licensing stuff, numpy is really good. Also integrates with a lot of Python stuff like machine vision, machine learning, etc, which have expensive or nonexistent packages in Matlab.
I used Octave for a year when my institution's Matlab license servers were being improperly administered. (I had a lot of project code written in Matlab, but the license server going down on the weekend before a conference deadline with nobody available to reboot it until Monday was a dealbreaker.) The biggest stumbling block was that Matlab has a huge and heavily used proprietary package library, and a lot of my existing code, (official) tutorials and Stack Overflow code assumed these libraries were available. In Octave I found myself reimplementing the newer parts of the Matlab image processing libraries. This led to the discovery that the Matlab and Octave builtins for handling image data are subtly different, so I ended up having to run tests in the code and write different conditional flows to make it cross-compatible. There are also subtle differences in basic behaviors (was it variable scoping? file handling?) which resulted in some surprise and frustration.
Following the licensing and Octave debacle, all my latest code is written in numpy.
Yep. When I was in grad school all the labs were furiously migrating away from matlab because of its costs and confusing licensing around running multiple replicas.
I'd definitely recommend checking out Julia for this usecase. You get code that looks pretty much like matlab, but which runs like fortran/C++. (Also there is very solid and fast interop with python, so you can call anything you need from the python side).
What does a Julia environment look like, in practice? Is it anything like the Matlab environment, where not online is there a console and integrated editor and super easy to use debugging/performance measurement, but also all the variables are visible in the GUI?
If so, I'd consider switching (as Matlab does that better than vanilla numpy). Julia is pretty great in theory. It is still a very new language for my uses, which means the documentation and community are orders of magnitude smaller than Matlab or Numpy/python.
Yes(-ish). You can use Julia in a Jupyter (the Ju- is Julia) notebook, just like Python. This is a pretty user-friendly experience for students, academics and data scientists.
Jupyter notebook is a lot like Mathematica. I’m wondering if visibility to variables is similar to Matlab? With Matlab, I have a list of all the variables in a nice little box, with summary of their contents (byte size, dimensions, type, contents if it’s small enough to be displayed, etc), and to see the full contents, I just double click on it.
Looks like Jupyter Lab and Atom/Juno is what I’m looking for. Still not as well-integrated as Matlab is.
I suppose that’s my attraction to Matlab. There’s not a bunch of different programs/environments to juggle to get a good, consistent experience for rapidly developing scientific code (for simulation, modeling, etc). All still possible.
I have the same experience, but it's more than just syntax. The Matlab IDE pulls together so much in a polished and robust product. Python notebooks and IDEs (Spyder, Jupyter, PyCharm, VSCode among a few others I've tried) are frustrating to use in comparison.
Yup, I agree 100%. I've been trying to use just vanilla python because of interdependency hell (and changing terms of service for anaconda), and I've been succeeding, but it's a LOT more work and less clear what's going on.
Python was pragmatic and adopted changes that numpy needed and advocated for. Maybe Julia is the only other worthy comparison?
Also, dynamic typing is a boon - and default & keyword arguments is a great feature for complicated, versatile, useful algorithm implementations and interfaces to them. Both of these features have a cost in bigger programs, but they really make Python stand out.
> C, C++, Fortran are still used, most Python users just don't see it because it's hidden away underneath the calling function.
Yes, the article talks about this: Python is a glue language and the actual heavy duty computation is being done inside an extension module like numpy that's written in a faster language.
A single threaded, not vectorized for loop in c++ runs faster than calling BLAS from Python with Numpy though. The Python glue makes everything slow. For example if you read out the camera in Python using OpenCV you'll get a way lower framerate than if you do the same from c++, even if in both cases you use OpenCV which is c++.
As a physicist, having spent eight years in academia, Python did not win by beating Fortran. Nor did it beat C++. It didn't really compete with Ruby or Lisp, although Lua (Torch) was a briefly serious competitor before everyone realized that a language developed by four people, one of whom doesn't get along with the others, couldn't be responsive to users' needs.
Python defeated Matlab. I know because I cheered it on. I was there. I watched my roommates and friends struggle with introductory scientific computing in Matlab and I joined the chorus that was practically begging for Python, even though I didn't really like it. I can't even begin to explain how awful it is to try to teach programming concepts in Matlab. But something like Python or Matlab had to be the choice because the schools wanted to teach programming through a language where you could just call "graph" and the computer would display a graph.
Python's team, unlike Lua's, aggressively courted educational institutions by offering scientific, numerical and graphical libraries within a programming language that works like a programming language, not a glorified computer algebra system. They even added a dedicated operator for matrix multiplication. It's a great example of finding a niche and filling it: I still don't like using Python, but I can't dispute that no other language/ecosystem comes close to offering what we need to teach programming to physics students.
You want to beat Python? Build a type system that can capture dimensional analysis. Warning: it won't be easy.
I'm in engineeering at a major engineering company historically using simulink and matlab. Python took over here in large part because matlab licensing caused so much friction, and we wanted to scale the simulink and matlab models up to run on a cluster of machines. We wanted to give scripts to people without matlab licenses quickly. etc. It was not the cost per-se, but the red tape.
We also ditched simulink because it is very difficult to version control and collaborate with a graphical interface.
Matlab is pushed heavily in the schools so all the engineers knew it and were comfortable with it. Matplotlib and numpy mimicing matlab very closely allowed the transition to be easy. We're not looking back. Only a handful of people still use matlab for their individual work because the python camp hit critical mass and the transition is not hard.
Matlab working to control serial ports, ethernet, visa/gpib instruments, all without the friction of getting extra licenses was icing on the cake. Matlab has a buy the cadillac model: the wheels, doors, hood, gas cap, mirrors are all optional add-ons. Each point causes friction, as only a few people had the whole tool, and therefore nobody could reliably share code.
> You want to beat Python? Build a type system that can capture dimensional analysis. Warning: it won't be easy.
Curious about your thoughts on pint and Unitful.jl — pint doesn’t really go all the way to a full type system, and Unitful.jl doesn’t work with everything (autograd is a problem still I think). But Unitful.jl is super cool.
This is the answer. Scientific Python was originally an alternative to MATLAB. When I was in grad school, I did most of my research in MATLAB. Then we had a visiting student who was doing very similar computations in SciPy, and he assured me performance was not a problem. I migrated my MATLAB scripts to Python and never looked back.
It was only after being a viable alternative to MATLAB did people decide it can be used for much more than what you typically get with MATLAB.
I think a factor in Python vs Matlab is that Python grew into areas where Matlab was not entrenched. Also, students with an aptitude for programming and an eye for the market want to learn languages that are used by software developers. Very few engineers actually want to program in Matlab. If they can program, then they want to market themselves as programmers.
A benefit of Matlab remains that it all comes from one place, with one installer, meaning that you can get a classroom full of students up and running almost instantly. And it offers some relief for students who will never grasp programming, through its collection of pre-written apps.
> where you could just call "graph" and the computer would display a graph.
Hang on! In what world can you just call "graph" in Python and it would display a graph?
In matplotlib on MacOS at least you try that and you get some bizarre shit about how Python isn't a framework, and you google it and find you have to do some obscure import and the import has to be in a particular order relative to other imports (totally unpythonic). https://stackoverflow.com/a/34583958/583763
Jupyter notebook... don't get me started! You do one thing and it starts a "server", and then you use that to start a "kernel" (and if my CS is dodgy and I don't really know wth these things are then I'm not having a great time already). Then this kernel thing is running Python. But oh, what version? And is it using my virtualenv? And then you google some matplotlib imports. And finally, yes you call "graph" and an ugly matplotlib png is displayed rather small in your web browser.
As a physicist who spent a decade in academia, including a PhD where all the new work was done in Python, it absolutely won in some fields by beating - or rather, by conveniently wrapping - Fortran.
(In particular, that’s how things have gone in the materials physics/solid-state/quantum chemistry field. It absolutely beat out Matlab in other fields. One of the underrated benefits was being a lingua franca across more of physics!)
Always nice to hear an authentic telling of history from someone who was there and had the necessary insight to interpret events and motivations. So much of what we read is "the victors' written revision".
It’s amazing how often the authors point of “agility” arises in real world circumstances. I’m not a programmer, but I use Python a lot in my engineering job. There have been 3 times in the past month where I got an order of magnitude speed up because SciPy implements a very complex but highly efficient algorithm which I would never have had time to deploy.
> There have been 3 times in the past month where I got an order of magnitude speed up because SciPy implements a very complex but highly efficient algorithm which I would never have had time to deploy.
Yes. I feel like the author conflates the language with the package ecosystem. Pure Python is pretty horrible for scientific computing (3*[3]=[3,3,3] is about as counterproductive to scientific computations as it gets), but Numpy changes the semantics of those operations.
In other words, Python has an absolutely stellar package ecosystem. There have been attempts to bring a package ecosystem to C, but it never took off. However, I do wonder how C would fare if it had.
this implies Python's advantages is not having a package manager, but better teachers or at least teaching better practices, so it isn't even language related.
If you know actual scientists, this isn't counter intuitive at all. My partner is a scientist, so now I know tons of them, and I have done a bunch of Python coding and support for scientists, have been a Python programmer (as well as other languages) since 2005-ish. I saw this coming (as did many) 15 years ago.
Most scientists, and their grad students, are trying to do a whole bunch of things in their research, and programming is just one of them. Field work, experiments, data wrangling, writing papers, defending papers, teaching, etc. And most of them do not have access to budgets for programmers or when they do, it's for a limited amount of time and work, meaning they need to be able to pick up and run with whatever the programmer did. So the fact that with Python they and their grad students (who might be there for only 2 years) can be working productively, and figure out what the hell the code did when they come back to it months later, is HUGE. As in, literally blows every other consideration to smithereens. This has meant that over the last 20 years the scientific libraries in Python got mature faster than in any other language, and this in turn has had a snowball effect. And when speed is necessary, C++ extensions can be written. But honestly, most of the time speed is not the main factor.
The downside of Python in my experience is that junior teams can make heinous atrocities when a project gets really big (I have had to step in as CTO to one of those messes, so much as I love Python, I must admit this is true!) But the stuff the scientists are doing is very rarely that big. It's tools programming, scripting, making utilities, data analysis and so on.
Readability counts. In some fields, it counts more than anything. I've worked in about 10 languages now over the last 20 years, and Python is still the easiest to read when you come back to some old code or have to pick up code for a small job, or hand it to a beginner to extend without having them create an unreadable mess. This is what scientists need to do all the time.
Re other people's comments on Python packaging and setup being hard, well honestly I've had just as much pain with Ruby or Node. The shining exception there is R, which is giving Python a run for its money in many scientific areas. R Studio has the best "hit the ground running" experience out there and is really slick for data programming.
In addition to not having budgets for programmers, we also don't know how to manage them, for instance how to communicate our needs, decide if their implementation plans make sense, or gauge their progress. Nearly half a century after The Mythical Man Month, managing software development is still generally acknowledged to be an unsolved problem.
The other two obstacles are that most programmers hate the scientific work environment, with its ever-changing requirements and frequent dead ends. And, the programmers who can work on math related stuff are in the highest demand.
Spot on with my experience! Much of our work was helping them manage the project and figure out how to work with us. And someone went on sabbatical, and then someone dropped their program, and someone else left for another school, and someone was stuck managing the program for a semester who had literally no time or experience doing that, etc. It's a Dynamic Environment. lol.
There is no other language I have used that makes it as easy to read code from somebody else, especially where that contributor is likely to be a domain expert with very limited programming experience. It's not actually my favourite language anymore (hello Scheme!) but if you want me to do work in that environment, I'll reach for Python first.
Network with scientists. Doing some small jobs or favours for scientists who will tell other scientists about you is the way to go. Universities are a good source of connections.
If labs are really struggling to find math-literate programmers, I would imagine it's in part because the process for matching them with the work is so terrible. Generally speaking, skilled programmers do not want to (and certainly don't have to) shake hands and do favors to find work.
I wonder if there's any concerted effort to fix that for academia, or if the "shortage" of math-literate programmers just isn't a problem worth fixing.
That was kind of my point. What skilled programmers do to find work is one thing, what scientists, who just need some help for a short project, do to find people is another. Assuming you are a programmer who wants to do work for scientists for some reason, you need to go where they are - they won't find you in your regular tech recruiting circles, which tend to be all about full time jobs. I happen to like doing some work for scientists so that my career isn't entirely about making private equity companies richer, but I don't expect them to pay my enterprise rates or find me on Linked In.
To make matters worse, university staff software engineering jobs usually pay 1/3 to 1/2 of comparable jobs in industry (even after excluding FAANG-level outlier salaries), and in most cases offer no meaningful career progression.
I think universities will never be able to compete for engineering talent until they can create attractive career paths for people who aren't professors.
Universities will never pay competitive salaries, because academic research is not supposed to create direct monetary value to the employer. An engineer does not create enough value in the academia to justify anything approaching a competitive industry salary.
It's also ethically difficult to advocate for higher salaries in the academia, if you are already living a comfortable middle-class life. The money would ultimately come from taxes and tuition fees. If you think that those should be increased, the money would be better spent on helping your colleagues who are earning poverty-level wages.
Engineering is a support role in the academia, because pure engineers don't teach or set research directions. Most labs and most departments are too small to employ more than a handful of engineers, if any. Only large research institutes have enough engineers working on similar topics to justify creating senior engineering roles.
Anybody who wants to do this has to be willing to step off the gravy train. That sounds snarky, but it just reflects the Hard Problem of a skill that's of value in two sectors with vastly different economics.
There are people who have stepped off the gravy train because they don't like it, or they don't fit into the enterprise workplace for whatever reason. I might be one of those people. I work in industry, but in an early-stage R&D team.
Maybe the status quo is a reasonable solution: Find grad students who are willing to do the work in return for a chance to sharpen their programming skills. This process could be improved by providing scientists with training on how to write better code. The result will be a certain amount of attrition of scientists into software development jobs, but we have to get used to the idea that attrition into a more employable field is actually a good thing, and there will be plenty of scientists.
>I've personally done this by finding normal work that is part time, so I can round it out doing work for scientists
That's what I did for the first 10 years after graduating from university. Eventually I transitioned to a full time 'normal' job but that made me unhappy.
Yep, can confirm. Ended up doing some physics simulation work for my PhD (as a computer engineering major), my advisor constantly emphasizes that I focus on picking up the math so he doesn't have to put as much effort into explaining exactly what he wants.
It's pretty fun to do for me, but it's certainly challenging to balance with programming despite my advantage of having taken a lot of extra math classes.
> Python is still the easiest to read when you come back to some old code
Lucky you. You must not have seen the "pythonic" monstrosities I've seen.
Python has such a low barrier for entry that one can "get stuff done" with absolutely atrocious and often very overly complicated OOP-ish code.
Ruby is not my favorite language, but I would bet real money that without dependence on libraries, nobody could show me Python code which I could not show more logical, consistent, and readable Ruby code which solves the same problem. I say Ruby because it's of the same "type" and follows similar methodologies.
Python suffers from far too many years under the leadership of one odd person. It has a cult-like following, whereby anyone who disagrees is an outcast. Where else could you hear comments like, "why would you ever need a switch statement? if/if else works fine!" That's just the tip of the iceberg.
Python is great for integration glue code, but only because of the libraries it has. But now it is becoming more Javascript like, and the dependencies are multiplying to the point where you're better off writing your own left-pad instead (or even re-evaluating your approach) instead of taking on new duct tape like django-database-view.
Sometimes the bar needs to be high enough to force the juniors to actually learn something before they start building "MVP" startups. On the other hand, who cares if the MVP is a horror show as long as you get that IPO and take your f-u money and leave.
So my real job is technical due diligence on companies being purchased. I get the keys to the kingdom when we do a diligence and trust me, there are just as many people making unmaintainable monstrosities that get bogged down in tech debt in Ruby. Looking at this scenario is literally my job, and the company I work for does more of these than anyone in the world.
Bad coders can make terrible stuff in any language, and with two as similar as Python and Ruby, the minor differences are a drop in the bucket in the grand scheme of things. Both Django-database code and RoR's Active Record have bogged down many a startup when they got big enough that DB size and query performance mattered.
None of which, as I pointed out, is relevant to the vast majority of scientists writing code.
In your experience, what are most buyers looking for when they get you to do technical DD? Is there a specific set of things they are worried about? Specifically looking to confirm, etc?
Typically they are looking to us to surface areas where they will have to make "disproportionate investment" (as they say) to allow the company to support a ramp up in growth. "What will be a problem when you have 2x as many customers? 10x as many DB entries? 10x as many customers?" etc. Because private equity funds buy companies that are growing and (usually) already profitable, this often equates to tech debt that happens once the DB is big. A very common scenario is that reporting, for example, has become a problem and it's time for the target (as we call them) to be using heavier weight architecture patterns like command query segregation or dedicated reporting databases or materialized views and such. So in the case of Ruby and Python shops, we will definitely be asking if their domain layer is working ok and trying to find out if they've written themselves into a corner by having the code assuming it will always be an RoR app or whatever. I have interviewed more than a few that were in serious trouble from not isolating their Active Record dependency and thus got themselves in a situation where efficiently fixing the database was going to be a lot of rewriting. We see this in other languages too, but Active Record is absolutely a smoking gun there.
The takeaway: always, always, have a domain layer that allows you to refactor your model access without changing tons of code. Data load grows in ways people don't predict. If your company succeeds, in five years you'll wish you had it!
Tech debt kills loads and loads of companies and most folks never hear about it because that's not the stuff that gets publicized or written about. We call it the silent killer...
That's really reassuring to hear. Whenever I say we should spend time on tech debt it's always greeted with "we can worry about that later when we're really successful" as if there will suddenly be an opportunity to completely rewrite everything (because that's what it will take).
Yeah, the idea that tech debt doesn't matter and can just be fixed later is the biggest bullshit myth out there in tech land. The thing is, no one notices or writes about when a startup does an "underwater sale", which is typically an investment firm's portfolio company (company they already own) buying up a competitor for less than the company was worth on the last round, done usually in order to buy the customers, staff, or IP. It happens tons. It's a "cut bait" scenario (i.e. let's lose less now instead of everything later) for the selling company/owners and is usually a result of technical debt.
> Ruby is not my favorite language, but I would bet real money that without dependence on libraries, nobody could show me Python code which I could not show more logical, consistent, and readable Ruby code which solves the same problem. I say Ruby because it's of the same "type" and follows similar methodologies.
I keep hearing all the stories about Ruby supposedly being more logical and sound than Python. I really would love to see actual source code being cited to back those claims.
- The OOP features of Ruby are consistent and ubiquitous (everything in Ruby is an object); Python depends on manual patterns to do OOP (self as the required first argument). Python also depends on special decorators to indicate what functions are instance vs static. Ruby does have a difference in definition, but it's simpler and more obvious (and requires one line fewer of code to define)
- Only finally has Python gotten a switch statement, and surprisingly it has adopted some Elixir-ish pattern matching features. Incidentally, some in the Python community are strongly against this new thing. "Why would you need that!?" Prior to 3.10, you would need more complicate if/else if structures in Python to do the same thing you could do in a concise and clear Ruby case (switch).
- Operations on collections: this is often described as functional programming, but it really is just "doing stuff on collections of data". And in that story, Python's list comprehensions are arguably less readable and less logical than Ruby's. Many of the tools you need in Python must be explicitly included from functools module.
- Ternary operator: many languages have `expression ? do_true_path : do_false_path`. It's a very common pattern which is concise and honestly quite clear. "This thing is true? then do this; else do that". But in Python you break that up into "do_true_path if expression else do_false_path".
- Everything in Ruby has a return value, but not so in Python. So in Ruby you can make assignments (or return values) from the result of if/case. For example, assume you want to return a specific value based on some series of conditions, such as handling an error and returning some enriched data based on the error code. In Python you will have to define a local variable and explicitly set that variable equal to some value in each branch of the conditional. Then afterward, you can use the value of the local variable. Or you would have multiple returns, one in each conditional branch. In Ruby you can simply do x = case ..., or because the last statement of a function is the return value, you wouldn't even have to return it. You just have 'case ...', and the value of the branch is what is returned.
There's a lot more. Some of it is subjective, but my belief is that once someone really knows both, they will prefer the Ruby way. And the more languages you know, the more you develop refined tastes. Ruby still holds up well after knowing 10 languages for me.
(added): the whole whitespace as code thing of Python. It does have one common pitfall, and that is when a line of code gets accidentally indented or unindented below a block which was indented. That line changes scope, likely changing the runtime result; but it may be technically valid, so the developer may not notice the mistake. This is just not a problem with languages that have { } or begin/end delimiters.
> Readability counts. In some fields, it counts more than anything. I've worked in about 10 languages now over the last 20 years, and Python is still the easiest to read when you come back to some old code or have to pick up code for a small job, or hand it to a beginner to extend without having them create an unreadable mess. This is what scientists need to do all the time.
Meh. Python might be readable at the smallest scale, but then COBOL is even more readable. What matters is large-scale development, and your implied point that large Python projects turn into unstructured big-ball-of-mud monstrosities is well taken. A big ball of mud is not surveyable, or "readable".
Which is where other modern languages (e.g. Julia in the scientific programming domain, heck even Go or Rust) will probably have an advantage.
think what you will, the scientists disagree. Which was the point. Not holding my breath to find many scientists matching my description who would rather learn Go or Rust...
I'll be the counter example. PhD in life sciences, but I've also been programming since I was a teen. Rust is by far my most used language for both general fun projects and in my role as a programmer in the life-sciences. Python is OK for ad-hoc analyses, but I cannot stand to use dynamically typed languages for anything "real" given how much difficulty dynamic typing imposes on reading and understanding code.
Sure, but by your description, you aren't really the people I'm describing. If you've been doing this since you were a teen, you're a "Real Programmer". My point was that people who have to do this as item 7 of 10 things in their job description are very much less likely to learn something like Rust than Python. That is undeniably a bigger lift to a non-programmer. Python's success in the sciences is in large part due to how good a fit as a language it is for part-time occasional programmers.
I like all kinds of languages, but the only ones I would encourage my partner to bother with as tools for her science work would be R and Python.
They might even care, they just don't know any better. When the typical scientist "learns to code", there's no one around telling them how to do proper software engineering.
At best, they engage in the polite fiction that it just doesn't matter, because all that code is inherently "throwaway" stuff that's only used for playing with in the context of research. Of course even that is wrong, the code doesn't really disappear like that.
No. Scientists are smart people. It's not that they don't care or don't know better, it's that they have different priorities. Every scientist I know is smart enough to be well aware that they don't have the know how for proper software engineering, but they also do not have the time or resources to learn to write code the way you would for a long term product.
I would not be patronizing to these people, they are very smart. They just live and work in a world that is completely different from technical product firms in pretty much every regard.
You're just rephrasing the "throwaway code" polite fiction. Increasingly, publishable-quality research is expected to be publically reproducible, and that means the code must stay around, potentially in the "long term". Every scientist loves it when their research gets cited a lot, right? Well, those citations become worthless if you can't reproduce the research because the code is an unsurveyable mess relying on bitrotted, unsupported external components.
Some of the best code I’ve worked on has been python. Unfortunately, also some of the worst. 5000 line, single file “modules”, with spaghetti class hierarchies (5+ levels deep) and dynamic method calls making it nearly impossible to debug.
In a relatively terse language like Python, anything beyond a few screenfuls of code is already "large scale" development. It's unwise to keep it all in a single module.
Not to mention, what they are working on is often very abstract compared to the math many programmers are used to doing. I write a lot of boolean; my scientist partner writes regressions, surface transformations, eigenmaps, linear algebra, and so on. Imagine being something other than a programmer by trade, and trying to apply linear algebra to your problem without good tools or libraries.
I'm afraid I don't agree. MicroPython is neat, but if there's one thing Python is not suited for, it's microcontrollers. Coupling one of the slowest scripting languages with low latency near-realtime requirements is not a recipe for success. It might be useful for teaching basic concepts, but it is not going to be useful for real applications. And Arduino already has the teaching of basic concepts nailed very effectively.
I certainly think that MicroPython serves a niche, primarily very simple hobbyist/educational roles. However, I do not regard it as suitable for anything beyond this. It's the wrong tool for the job, and if you want a scripting language for low-latency low-overhead use, there are smaller and more efficient languages which fit better into an embedded role.
You’re an engineer. And engineering programs will continue to use engineering focused tools. The goals of engineering program projects are vastly different than other areas of academia. Developing the tool is the project for engineering.
Bioscience / environmental science programs will find micropython good enough for their needs. The tool itself is just the means to the end if real science. Micropython let’s you deploy in lower power applications without having to learn much beyond what you already know from Jupyter notebooks.
I really don’t know any PhD students or post-docs in microbiology/environmental sciences who have the time to learn embedded C or similar languages.
As a rubyist, it makes me sad that python ended up here rather than ruby. And I sometimes wonder why.
> As the name suggests, numeric data is manipulated through this package, not in plain Python, and behind the scenes all the heavy lifting is done by C/C++ or Fortran compiled routines.
So I wonder, was it easier to write C/C++ or fortran compiled extensions in python than it was in ruby?
Readability, 100%. I have programmed in large projects in both Python and Ruby.
Ruby is very productive to write, because everything and the kitchen sink is at your fingertips at all times.
But because of Ruby's many ways to skin a cat, everyone's code is very different. Add to that the penchant for domain-specific sub-languages in Ruby: new syntaxes that you might have to learn half a dozen of to integrate a large project, all of which end up being more limiting than if you could just, you know, write Ruby.
Contrast with Python, which goes so far at normalizing as to have a language-wide coding standard in PEP8. Python has its problems, package management and distribution is still ugly for example. But I can read any project I find and understand it without loads of context.
My impression is that Perl, Ruby, and Lisp all suffer from this issue.
Even proponents will say things like "this language is so expressive, I feel so productive in it, but people do such idiosyncratic and clever things that it's hard for anyone else but the author to understand".
That sort of "solo rockstar" programming culture doesn't really lend itself to large-scale FOSS projects, which need to be inviting to wide participation.
A language like Ruby can be very productive for someone who has climbed the learning curve to learn all its ins and outs. However, in my experience, this turns into a large productivity drain the moment someone else (who is less of an expert) has to touch it.
For large projects with multiple developers, readability should win over writability every time, most code are read more than they are written (my hypothesis). You can see evidence of this, given the success of languages like Python and Go.
That said, for scientific compute, in a lot of cases, writability matters way more, as your job as a scientist is to produce results as fast as possible, code quality be damned. However, only a small number of scientists are expert developers and have climb the learning curve and can write code with ease. The vast majority of them are junior at best, and Python's approachability (which is rooted from its readability of course), wins. With most of the people using Python, the ecosystem develops and there are no other viable alternatives. In the long run, I suspect even languages like MATLAB and Mathematica will die out as the open source stack becomes more mature and (eventually, if not already) significantly more capable. Julia might be a wildcard due to its (potential) performance advantages, but the aesthetics of the programming language is simply not in the minds of 99% of the scientific compute users out there.
I am totally interested in hearing opinions from people who have done serious hours of programming in both ruby and python, as to readabilty comparison.
If it's just people who have done a lot of work in ruby and little in python saying ruby is more readable, and vice versa, I don't find it very useful even anecdotally.
As someone that have spent a lot of time in both Ruby and Python (and specifically a lot of hours in the Python scientific compute stack), I would say that Python is significantly more readable. Python is also significantly easier to teach as opposed to Ruby, especially if the target audience already has a bit of programming experience (from MATLAB, or other courses).
I suspect the main reasons are:
1. Python's guiding philosophy of "There should be one-- and preferably only one --obvious way to do it". With later additions to the language, this is getting less true (3 different ways to format strings, asyncio, type hinting, etc). Some libraries also don't conform to this (matplotlib). That said, it's a lot better than the Ruby code I've encountered, which is like the wild west.
2. Python's syntax is reasonably simple to teach. The object model could be condensed into something very simple if you don't need a lot. With very basic knowledge, you can go a long way. Ruby's a bit more chaotic with things like inheritance, extend, and include; proc, block, lambda; having to use attr_accessor; syntax things like a = b could be a function call or not; if/unless; and many more things that are confusing.
3. Even basic things like loops in Ruby is not idiomatic as it wants to apply a function/block instead. Beginners, especially those with a bit of background, like their loops better than functional programming.
As I've spent many years working on Ruby code base I still get lost all the time. Python in my experience has been a lot better, although recent Python versions have regressed a bit as it introduced more syntax to do the same things.
>So I wonder, was it easier to write C/C++ or fortran compiled extensions in python than it was in ruby?
Don't know about technical aspects of "easier" but it may have simply been an accident of history.
E.g. in 1995 (before Ruby 1.0 December 1996[0]), David Beazley was already wrapping C Language code to Python. Deep link to presentation:
https://youtu.be/riuyDEHxeEo?t=52m27s
So DB's Python code for scientific code was released in Feb 1996 and presented in July 1996. Python being released in 1991 was already talked about in magazines in 1995. David's presentation also references Jim Hugunin[1] and he authored the 1995 Numeric package which was the ancestor to NumPy. Once an ecosystem gets started, it can attract more mindshare and snowball into an insurmountable lead that neither Ruby nor Julia will ever catch up to.
In other words... If the opposite timeline happened and Ruby was released earlier in 1991 and Python later in 1996, things may have played out differently.
So folks like David Beazley and Jim Hugunin chose Python as the scripting host language for their C Language code probably because Ruby wasn't mature and well-known back in 1995. Apparently, Ruby didn't widely spread outside of Japan until 1998 when the first documentation in English appeared.[2]
In 2009 I began writing code for a new company doing natural language processing. I was the engineer at the time and got to pick my tools. I started with Ruby because I was sick of C++ and Perl and Ruby looked like the future. But I soon discovered the NLTK and then Numpy and so I started playing around in Python. I never again wrote line of Ruby… until the later hired front end devs threw a fit of not being able to use Rails.
It was clear at the time that there basically was no non-web Ruby community. Ruby was Rails and Rails was Ruby. Ruby had a nice little niche in 2009 but the Python had Numpy and there were a lot of ML people doing lots of math and Ruby wasn’t going to cut it unless they wrote their own libraries, which wasn’t worth the effort since Python and Nunpy already existed and already had a growing community behind it.
I honestly think it all boils down to numpy being developed long before matrix libraries became a standard part of software development.
Ruby's early "killer app" (remember that term?) was Rails. Even to this day there is almost no major code out there built in Ruby that isn't ultimately related to building CRUD web apps. While Ruby may be losing popularity now, it moved the web-development ecosystem ahead in the same way that Python has moved the scientific computing world ahead.
20 years ago if you wanted to use open source tools to performant vector code there was Python and a hand full of oss clones of commercial products. Given the Python was also useful for other programming tasks in a way that say Matlab/Octave is not, it was the choice for more sophisticated programmers who wanted an OSS solution and need to do scientific computing. This creates a positive feed back that persists to this day.
Given that Python remains a decent language relative to it's contemporary peers and it has a massive and still growing library of numerical computing software it is extremely unlikely to be dethroned, even by promising new languages like Julia.
Even to this day there is nothing even close to numpy in Ruby. I do DS work in an org that is almost entirely Ruby, but we still use python without question because we know re-implementing all of our numeric code into Ruby would be a fools errand.
Had ruby had early support of matrix math, it wouldn't have surprised me if it would have replaced Python.
But that begs the question -- why did numpy develop in python and not ruby?
The rest of the thread offers some suggestions though. One is simply that python was born first, and got the numpy precursor before ruby 1.0 even happened. Which seems like a thing.
Ruby had a numpy style library since the early 00's, I forget exactly when. But it never got the kind of momentum numpy and the Python ecosystem surrounding it did.
Lots of comments in this thread from people who's Ruby experience is only from the post Rails era after ~2008, and don't understand that the post Rails culture wasn't really a thing when Python was first gaining momentum for scientific computing.
Perl was my horse in the race. I attribute it's, lisp's, ruby's, etc loss to
1. "There should be one-- and preferably only one --obvious way to do it" being part of python's ethos.
2. ipython repl
1. pairs with jaimebuelta's artistic vs engineering dichotomy, but also plays into the scientist wearing many more hats than just programmer. Code can be two or more degrees removed from the published paper -- code isn't the passion. There isn't reason, time, or motivation to think deeply about syntax.
2. For a lot of academic work, the programming language is primarily an interface to an advanced plotting calculator. Or at least that's how I think about the popularity of SPSS and Stata. Ipython and then jupyter made this easy for python.
For what it's worth, the lab I work for is mostly using shell, R, matlab, and tiny bit of python. For numerical analysis, I like R the best. It has a leg up on the interactive interface and feels more flexible than the other two. R also has better stats libraries. But when we need to interact with external services or file formats, python is the place to look (why PyPI beat out CPAN is similar question).
Total aside: Perl's built in regexp syntax is amazing and a thing I reach for often, but regular expressions as a DSL are supported almost everywhere (like using languages other than shell to launch programs and pipes -- totally fine but misses all the ergonomics of using the right tool for the job). It'd love to explore APL as an analogous numerical DSL across scripting languages. APL.jl [0] and, less practically april[1], are exciting.
From what I remember, people were actively promoting Python as the first programming language already in the 90s. Many universities started teaching Python, creating a steady supply of non-CS majors who were familiar with Python but no other language. And because the community was there, people started building the ecosystem.
In contrast, I've never really encountered anyone advocating for Ruby outside web development.
van Rossum was one of the implementers of ABC, which indeed was created to experiment with how to develop a programming language for beginning programmers. (Note: he was not one of the designers of ABC.)
While van Rossum drew from that experience when making Python, the initial driving goal was as a scripting language for system admin tasks in the Amoeba distributed operating system.
David Beazley talks about this in a YouTube video somewhere. (Can't find it right now, maybe someone will in the comments.)
It was a lot of serendipity. Python was up and running when the US national labs wanted to collaborate and their tools all sucked. Since they wanted visualization this left only Tcl/Tk or Python/Tk. And Beazley was hanging around as a grad student in a national lab with a connection machine, no real boss, no real oversight, and very little budget. He built stuff out of Python, and it snowballed to other labs.
Timing could be a factor. Python was released in 1991. Numeric, the ancestor of NumPy, followed in 1995, the same year Ruby was released. So Python already had its hooks into scientific computing before Ruby even started.
Fortran interop (f2py in particular) was a significant factor, and as soon as you get one thing (in this case LAPACK and BLAS bindings) it snowballs. Also, Python is significantly more initially familiar for informal programmers and that’s critical; the hard part of learning a language is often believing that you can -and Ruby looks weirder than Python, so it makes people doubt themselves.
I don't know how easy it is in Ruby so I cannot give you a comparison.
However it is very very easy to write Python bindings for a C/C++ library with minimal work. Solutions range from "just works" like ctypes to "actually integrates with the language" like Cython. You also have automated tools for wrapping like pybind11 which does a lot of the heavy lifting for you.
It was multiple things, really. I would attribute ute some of it to Swig, Perl attrition, SCons/Software Carpentry, integration with GUI libraries, good documentation, and various other efforts in the mid 2000s. A lot of those things were solving research problems simply, and Python’s use just kept expanding.
Python was already taking over in many use cases by late 2000s.
Ruby was known, but it didn’t have the following at multiple levels in academia like Python did
You describe what happened, which I saw happen too. The question I have is why though. Right, why did python's use in scientific computing keep expanding, and not ruby's? Why was python already taing over many use cases by the 2000s, but not ruby? Why did python develop the following at multiple levels in academia, and not ruby? (Why is Perl attrition relevant, when ruby was in fact explicitly based on Perl?)
That's the question, not the answer!
It seems like a lot of the answer is NumPy, which makes the question -- why did NumPy happen on python, not ruby?
Certainly one answer could be "nothing having to do with the features of the language, it's just a coincidence, they chose to write it in Python, if those working on numpy had chosen to use ruby instead, history would be different."
But one hypothesis is that maybe NumPy wouldn't have been as easy in ruby as python.
Someone else suggested the first numpy release happened before the first ruby release, so that could also be an answer.
I think the difference is in the community. I've used both Python (extensively) and Ruby (a little bit). While the capacities of the languages are relatively similar, the people around the languages, at least the ones creating packages and driving the discussion in conferences are actually quite different, for some reason.
People attracted to Ruby are mostly of an "artistic mindset", they want to be expressive, write code that doesn't look like programming code and using "magic" like dynamically created methods, monkey-patching, etc is accepted or even encouraged.
On the other hand, Python attracts more people with "engineering mindset", they like straight forward code that's readable, clear and understandable, even if it's not as expressive. "Magic" elements are frowned upon: for example, imports are explicit and always included in each file.
Obviously, I'm exaggerating it, but I think is a clear differentiation between the communities.
My guess is that the "Python mindset" got into creating better integrations for "engineering applications", like NumPy or SciPy, and that created some positive feedback in certain environments. The main strength of Python is its rich ecosystem of third party packages. There's a compounding effect, making it grow faster and faster.
I think that’s exactly it, and that there’s much less understanding required to start reading and writing Python code. Ruby has some beautiful features, but they make it much less clear to newbies who are trying to figure out what on earth’s going on.
Ruby makes it easy to do "magic". Which is fun to write, but painful to read for others.
I've encountered real cases of ruby code where a simple code snippet behaves differently with and without a `require` (IIRC, some utility function added to a class with monkey patching). In another case I've also had to modify (and to some extent maintain) a codebase that relied on overriding `method_missing` in the happy case / normal flow. I was trying to find out where some method was being defined by grepping the whole codebase. It probably cost me half a day of unabridged profanity.
In theory you can do the same thing with python -- thing is it usually doesn't happen for some reason (likely the ones you mentioned). Something about the language features and the culture in the community lead to devs doing different things with the different languages. But the effect is real, and I know which language to avoid if I had the choice.
I was using Perl and Python in the 1990s for scientific work.
Around 1993 I got hooked on Perl. I read the Perl book and it was great. But 1) I couldn't figure out how to handle complex data structures (this was Perl 4), and 2) I couldn't embed it into other projects.
More specifically, worked on a molecular visualization program called VMD. It had its own scripting language. I wanted a language to embed in VMD that was usable by my grad student users. This is when I first learned about Python, but I chose Tcl because it fit the existing command language almost perfectly.
At around the same time, UCSF started embedding Python for their molecular visualization package, Chimera, so it was already making in-roads in structural biology.
I later (1997) went into more bioinformatics-oriented work, where I did a lot of Perl. I tried out one implementation (a Prosite pattern matcher) in Perl - which took me reading an advanced Perl book to learn how Perl 5 objects worked. I then tried the same in Python, a language I wasn't as familiar with. And it was just so much easier!
At this time Perl was THE language for bioinformatics, but I thought it was a difficult language for complex data structures. (Bioinformatics at that time was mostly string related, plus CGI and databases - Perl was a great fit.)
I then moved over (1998) to cheminformatics, working more directly on molecular graphs. Python was a much better fit for those data structures than Perl. I started using Python full-time, and it's been that way since.
We used a third-party commercial package for the underlying cheminformatics called the Daylight toolkit. It had C and Fortran bindings. Someone else had already written the SWIG configuration to generate Perl, Python, and Tcl bindings, but these still meant manual garbage collection.
I was able to use __getattr__, __setattr__, and __del__ to turn these into a natural-feeling high-level API, hooked into (C)Python's reference-counted garbage collector.
I presented a couple of talks about this work, got an article in Dr. Dobb's (!) and got consulting work helping companies which either had existing Python work, or were moving to Python.
By contrast, I don't think I heard about Ruby until 2000 or so, years after Python started entering structural biology/cheminformatics. [1]
I wasn't particularly cutting edge - others had already developed tool like SWIG, which was because Beazley and others were using Python at LANL. Numeric Python started in part because of work at LLNL and other research organizations. The concept already firmly established was that Python would be used to "steer" a high-performance kernel.
And Python in turn changed, to better reflect the needs of numeric computing, in particular, the "..." notation in array slices was added to make matrix operations easier. (This was 20 years before '@@' was added to simplify matrix multiplication.) I believe the needs of numeric computing also influenced the changed to "rich" comparisons.
This all took place around the time Matz started developing Ruby. Python had a clear head-start. And except for bioinformatics, Perl never had much presence in the fields I worked in.
So:
> why did python's use in scientific computing keep expanding, and not ruby's?
Because Python was in-use several years before Ruby, and already rather visible as one of the three main languages to consider in that space (Tcl and Perl being the other two).
> Why was python already taing over many use cases by the 2000s, but not ruby?
Because people didn't really know about Ruby, while Python already had a pretty large user community. Probably also because Python's work was all in English, while a lot of the Ruby community was using Japanese.
> Why is Perl attrition relevant, when ruby was in fact explicitly based on Perl?
Perl attrition started before Ruby was much known. The complexity of the language, and the cumbersome need to roll-your-own OO, made it difficult for me to recommend to the typical software developers I work with - grad students and researchers in the physical sciences with little formal training in CS. Python by comparison which easier to pick.
So a language which explicitly based on Perl also picks up that negative impression.
(FWIW, I think Tcl is an easier language to start with than Python.)
"""In 1995 the special interest group (SIG) matrix-sig was founded with the aim of defining an array computing package; among its members was Python designer and maintainer Guido van Rossum, who extended Python's syntax (in particular the indexing syntax[8]) to make array computing easier."""
"""The first public release of Ruby 0.95 was announced on Japanese domestic newsgroups on December 21, 1995. ... In 1997, the first article about Ruby was published on the Web. ... In 1999, the first English language mailing list ruby-talk began, which signaled a growing interest in the language outside Japan."""
""" I think my criteria for
selecting Python over Perl is still true for Python over Ruby,
in that it has too many special characters (like @ and the
built-in regexpes), features (like continuations and code
blocks) which are hard to explain well (I didn't understand
continuations until the Houston IPC), and 'best practices'
(like modifying base classes like strings and numbers)
which aren't appropriate for large-scale software
development."""
Nitpick: Numpy is the newest, revised and reconciled vector library for Python; The first one was called “Numeric”; then there was “Numarray” which was not fully compatible, which caused a bifurcated ecosystem; and then IIRC it was Travis Oliphant who decided enough is enough, created Numpy which was somehow magically backward compatible with both, and reunited the community.
> It seems like a lot of the answer is NumPy, which makes the question -- why did NumPy happen on python, not ruby?
One of Python's original use cases was as a macro/script language you could import into your C application. Adding python to your C app took a day or so, and a side effect often was you'd make a Python library out of your app's library and suddenly, you could write standalone Python that called your app's code. Because it was so easy to write a python wrapper for an existing C library, by 1995/96 when Ruby hit the scene, Python already had quite a bit of importable functionality. The first serious web framework for Python was Zope, and it came out in 1998. I think Rails was around 2005/2006, and it was really cool, but one of the rubs was Ruby didn't have the libraries that Python did. In reality, it's amazing how good Ruby and Python have been.
It’s all about the community. As soon as a language gets attached to a profession it’s hard to break. Ruby has primarily been a web dev language, also the syntax is bad =P
> As a rubyist, it makes me sad that python ended up here rather than ruby. And I sometimes wonder why.
Work on numerical packages and scientific computing started almost as soon as the language did, for instance the origins of Numpy lie in the Numeric package which was introduced in 1995.
And the core team introduced several niceties at the behest of the scientific community (advanced slicing for instance, more recently the matmul operator).
not sure. there are many factors that contributed to python's success.
i discovered the language in 98 or 99. it came with some obscure linux distribution and the tkinter module stood out for me. it showed pretty scientific graphs and charts. but the language has to reinvent its community many times since then.
my intuition is that it was popular in europe in the scientific community. not sure i can say the same for ruby.
The performance Python is a real problem but Python has succeeded because scientific computing really needs interactive and dynamic programming languages. You need something which lets you easily experiment with data, plot, change code in rapid iterations without constant recompilations and reloading of data.
This has been recognized for some time. The compromise had been to build performance sensitive parts in C/C++ and do the experimental/iteration part in Python.
But today you don’t really have to compromise anymore. We got Julia. It solves the whole problem. You get the interactivity you need combined with the performance.
Of course in my his industry momentum matters. Python has built up the momentum of an oil tanker. Even if you shut off the engines it is going to keep going for many years.
But Julia is the obvious end station. It does all the things HPC and scientific computing needs. But building mains share, documentation, community, polish tools etc will of course take time.
It solves the whole problem. You get the interactivity you need combined with the performance.
There are other aspects to the "whole problem" -- you also need a massive ecosystem with adoption across disparate communities (devops, web development, etc). And decades of momentum.
That's why Python isn't going away anytime soon, despite its slowness and warts.
As many already noticed, the rise of Python is not counter-intuitive at all. (I'm a scientist myself).
Basically modern python offers you a spectrum from easy to understand and quick to write python programs (those will be slow), to purely glue code that connects a lot of high performance c/C++/fortran code.
And many scientists will start from pure python code with the help of numpy. In many cases it will be good enough. But if needed you can always interface with other libraries, or write yourself high performance c/c++/fortran code for the most performance critical bit, and use python to glue it together. That flexibility where you can trade speed of writing the code with the speed of execution is very valuable.
At this point we can say that against the two criteria of a spectrum from prototyping to heavy lifting and ease of embedding external high-performance libraries, Julia is simply better than Python. Julia does have two drawbacks of being tied to the one, rather heavy metal, implementation and lacking the wealth of libraries outside scientific computing.
From just my personal experience, I've had a python code interfaced with C that I rewrote just for fun in pure julia. It was significantly slower then the C code and I couldn't as easily use OPENMP parallelization (although the symbolic derivatives are great). Obviously I know julia much less well, but so far in a few cases I tried, I could not convince myself that julia offers me enough advantages over my current approaches that rely on python & C/C++
As I guess you're hinting at, writing performant pure Julia is a quite different kettle of fish to writing performant C: it's not surprising that a first attempt isn't a rousing success. But there's a spectrum of rewriting possibilities: you can write Julia-interfaced-with-C as a direct analog of the existing code, and you can convert just those parts of the C code that you think would most benefit from Julia's JIT into Julia, leaving most of the heavy lifting in C.
This is changing BTW. There's been lots of improvements in escape analysis and hoisting allocations out of loops in the upcoming v1.9 that will start to make "bad codes" a lot less bad. In fact, it's already starting to impact how to write tutorials on what is a bad code haha.
True, although the Python-Julia interop is surprisingly good. Gradual migration of legacy code from Python to Julia might be a possibility for some of these groups. But I was really thinking about the situation for new projects when writing that comment.
Same thing, you will want to reuse existing code, and you do not want to split knowledge in your student group. There is no such thing as 'gradual' migration: I either have to support Python in my group, or Python and Julia -- until everything is migrated.
If I have a working ecosystem using Python, with students trained in Python, and all previous work in Python, there's a whole lot of opportunity cost associated with me deciding to have the next student use Julia. I'd rather have that student build on existing tools and knowledge and do something new with their time.
This article has been written a hundred times. "We abandoned a fast language for a language that is slow but can use fast libraries, and so the result is fast. It's faster because the programmer discovered existing libraries that do a better job of what they were doing already."
There's so many convolved factors here I don't even know where to begin, so I guess I'll just say that I'm glad Julia exists. The author glosses over many decades of programming language and compiler research -- which makes sense, because this is not their specialty. However, what I see is the field of scientific computing migrating from a dinosaur language (Fortran isn't, actually. It just is used this way) and dinosaur practices of writing everything oneself, to one of the slowest interpreted languages that happens to be the most difficult to JIT, and saying this or that about how interpreted languages are slow but library calls are fast to justify this. At the same time they're learning to build a functioning library ecosystem.
Basically, grad students are learning proper programming practices and collaboration after switching to a more expressive language, they just managed to pick the slowest and most difficult to optimize one. Maybe they just managed to wipe some of the slate clean by switching away from Fortran and its culture (the culture being the bad part), and the culture of Python filled the space, creating a net positive but somewhat unfortunate situation.
Just one more time -- the idea that you can call a "faster" language to do the heavy lifting is true of every language and does not justify the choice of Python in particular. The justification for Python is the momentum, and this is in my opinion the only one.
Your conclusion seems a bit reductive/circular and begs the question "yes but why does Python have momentum?"
Python is/was chosen because the syntax is clean and expressive and obvious if you speak English (which most people in this context do) and because although performance is almost always worse in interpreted languages, there are clear productivity benefits when doing the kind of programming that is demanded from data science / data processing / etc. Same for dynamic vs static typing and a number of other choices made by Python.
Specifically, many (most?) programs are not long-term maintained projects in this space. A lot of them are just little scripts to convert one dataset into another format, or scrape specific content from somewhere, or support a scientific paper that will not get updated after publication.
Python is sufficiently readable, and with the right extension, it is sufficiently fast for vast majority of the purposes. For Julia to truly gain momentum, I think it needs a "killer app/library". However, I'm not sure what it would be that would not already be built for Python.
My personal killer app would be a significantly revamped plotting library/app. While matplotlib is great, it is fundamentally based on imaged-based plotting. The next generation of data visualization, imo, will likely be interactive. Having an interactive plotting library that allows you to produce publication-quality plots faster and simpler (think of all the time spent aligning text manually..) could be a big deal, but it could also not matter as no one else wants the same things I do.
Have a look at Makie.jl[1] in Julia. I've been using it for exploring large data sets recently. Ticks your boxes. Jupyter version is image based though, as Jupyter is inherently static. You could use Pluto.jl[2] to build a reactive page.
Thanks for the link. Makie.jl looks interesting. I didn't find it last time I looked into Julia. I'll get it a shot at some point to see how usable it is.
That is a big part of the author's point: the library ecosystem is here for Python today. While there is a heavy penalty for anything written in Python itself, it doesn't really matter since there isn't much of a penalty once the data is passed to highly optimized libraries and those libraries allow developers to select efficient algorithms rather than implementing their own algorithms (which are likely to be less efficient).
In this thread I am seeing a number of explanations, including:
Ecosystem; mind-share; readability and engineering mind-set; history/Numpy/Matlab; teachability and academic focus.
There are also comments emphasizing the "dynamic" scientific environment and need to just pick up code left by others.
In terms of the latter, could one apparent requirement be this: The main contact should be with top-level code which at least looks like it is interpreted -- even if through compile-with-run-combined and/or memoization? Need part of the user interface, so to speak, be to hide all intermediate artifacts, even the very thought of object code and executables? That such stuff is for, say, "module creators" not primary users?
Do you think Julia will chip away at Python's marketshare, or Fortran's? I thought it was aiming to be more of a replacement for the latter, but I've never written a line of Julia in my life, so I am very uninformed.
Imo, it eats away at both. Julia makes it relatively easy to meet or exceed Fortran performance, but also gives you the high level abstractions and ease of use of a language like python. I think the biggest problem for Julia currently is the difficulty of AOT compilation and the lack of tiered compilation (like Java/Javscript). Making the story for either of these better would be a significant quality of life improvement for Julia, and would make it pretty much unrivaled for scientific computing in my opinion.
Julia advertises itself as solving the "two-language problem". This assumes that people first write exploratory code in python or something similar, and then rewrite it in Fortran etc.
So in this scenario, Julia takes marketshare from both.
Personally, I find that many Fortran codes are still used because they have been build for many years, and they can't be rewritten easily. On the other hand, new data science projects start all the time, and the transition to Julia is easy (and worth it in my opinion). That means that in my experience, Julia is mostly competing for marketshare with NumPy/SciPy/SKLearn/Pandas/R/Matlab.
If you mean whether it's possible, PyCall.jl has existed since nearly the beginning of Julia, and PythonCall.jl [1] is a more recent package for the same core functionality - calling into Python code.
Counter-intuitive? I picked it because it was the closest scripting language to C (see the select and socket APIs for good examples). And it had numeric array support early-on (making it an attractive replacement for matlab).
Python is an API to efficient scientific computing code. It's good for that, assuming you're using old and more verbose languages.
Look into Julia as a promising alternative -- the language itself is superbly fast (aside from initial compilation) and there's an impressive scicomp ecosystem to say the least, all written in native Julia. This allows for program rewriting / metaprogramming more broadly and is insanely powerful once you get a feel for it.
It is really remarkable how much more expressive some languages are over others. If you are satisfied with Python for everything you do, then you are not hitting the limits of its expressiveness. But for more naturally expressive code, other languages may have huge advantages for certain applications.
I feel like python acts like a kind of bus in scientific computing, connecting various high performance libraries and DSLs together.
That said, this article's story of someone using the wrong algorithm is a bad example in my view. Python hasn't succeeded because people are more likely to use more efficient algorithms due to easier experimentation, it has succeeded because the of the size of the ecosystem and the fact such algorithms are easily available.
I recommend one of the recent videos by Dave Beazly [1]. He lived through and contributed to the raise of Python in scientific computing first hand in the 90s, and offers some interesting insights. Plus he's always quite an entertainer.
For those unfamiliar, CERFACS (Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique, i.e. European center for research and advanced training in scientific computing) is a leading research institution, with two main branches: meteorology, and engineering computational fluid dynamics. I am not affiliated and can only evaluate the engineering part, their combustion modeling group is one of the best in the world.
A lot of thanks should go to Oracle. Back in the days Java was go-to language for everything. After Oracle acquired it in 2009, the only respectable languages with good numerical libraries were Python, Julia and R. Unfortunately, Julia’s marketing wasn’t strong enough and R was decisively an ugly thing to work with.
Can you explain what you mean more in detail? Libraries can't change the syntax of the Python language, not in the formal sense.
Is this about things you want to be able to express in syntax but can't? Or the other way around - stuff that uses syntax/operators but should really be methods?
Numpy syntax comes to mind. The extra commas often aren’t valid pure Python but are required for some operations on numpy arrays. I don’t know how this works under the hood, but expect it’s a state machine under the numpy ndarray looking for the extra commas and such.
i.e. some_array[0:5,0] which isn’t valid pure Python notation.
Extra commas are "valid in pure python" in the following sense that I can demonstrate.
Open ipython3
In [3]: class Test:
...: def __getitem__(self, index):
...: print(index)
...:
In [4]: Test()[1, 2, 1:3, ..., :]
(1, 2, slice(1, 3, None), Ellipsis, slice(None, None, None))
It's valid and we get the complicated tuple of integers, slices, ellipsis etc as printed.
Numpy has existed for a long time. Its needs have been taken care of in upstream Python, to a big extent, and other libraries can use the same features.
Interesting! Neither myself nor my coworkers could get the snippet I posted working outside the context of an ndarray, so I had speculated at that time that it there was something else going on under the hood.
You seem to have a much better grasp of Python than us, would you mind posting an example where the snipped I posted successfully accesses data from an array in pure Python? That way I can not only take the L, but correct the record and learn something in the process.
This program is quick & lazy but it uses a 1D python list and pretends it's a 2D list. It implements 2D slicing, giving you a square subset just like ndarray. It doesn't intend to be all correct or nice or useful.
I'm not sure what you're referring to. Nothing you import into Python changes its syntax.
Maybe you're thinking of things like x[:, np.newaxis] where x is a numpy array? This is valid Python code outside of numpy as well, although the built-in data structures like lists and dicts won't know what to do with the :.
To be precise, you could model the same behavior on your custom types by using the dunder method magic. In that case, everything is "valid" Python code.
Numpy and Pandas libraries have some non-standard ways to slice arrays, get the subarrays and the data from them.
What language wouldn't suffer from this, besides APL? Even very recent and well designed libraries like Elixir's Nx look like another APL-like language bolted on. Pipe syntax helps but not much.
I wrote scientific python for several years at a university research project, coming from a statistics background. I wrote a forecasting tool and related plotting, simulation, ML, evaluation etc tools.
The reasons for python’s success are obviously the ecosystem. Numpy is the foundation. On top we have sklearn, statsmodels, pandas, matplotlib. Before our project most work in the department was done in Stata, a proprietary language/tool that works well for some classical regression and stats work but falls apart as soon as things get complicated. Moving to python allowed us, a group of social scientists, to work on some really hard problems.
Now we have boosted tree models and other tools that just can’t be used in the old tools like Stata.
I am really curious how Zig lang eventually does in scientific computing. It's already speedy compiler, language server (zls), and upcoming hot code reloading feature, makes me think that reactive coding and visualization notebooks in Zig should be feasible. Although, Zig has no operator overloading, and no dynamic dispatch though, making it fundamentally pretty different than say, Julia lang. Just as an aside: for my day job, I write Python in a scientific computing (geospatial and ML).
I'm genuinely surprised that no one here is mentioning D language in addition to Nim or Julia for replacing Python. D has already beaten Fortran in speed more than 5 years back, the legendary scientific programming language that's mentioned in the article [1]. The Fortran based libraries that are overcome by the D language apparently are still being used by Python, Nim and Julia for most of their high speed processing until today. As they always said the proof is in the pudding, and compare to all alternative D language is designed to have a similar feel to Python. By default it supports GC for easier and manageable scientific programming that is very attractive for the type A data scientist that are mainly deals with analysis and exploratory programming [2]. The latest D language is now also natively support the C language (lingua franca of scientific programming) in its compiler thus can import and compile C files directly [3].
[1] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:
> Of course, If the best algorithm is known beforehand or the manpower is not a problem, a lower level-language is probably faster, but this is seldom the case in real life.
One is wary of one-dimensional analysis of anything in a software context.
Who cares if the Fortran library runs like the blue blaze, if it cannot be readily maintained?
It is possible to write maintainable modern Fortran without gotos with small functions and subroutines. OOP with inheritance and dynamic polymorphism is possible since the Fortran 2003 standard.
This article expresses the ancient Python(/Matlab) v Fortran argument beautifully ... but it's kind of shocking that the argument is still going on at all. My generation came out of school happy to use FORTRAN indirectly, via a scripting language, for rapid prototyping. That was 30 years ago.
I don't think Python displaced Fortran in HPC as much as it displaced Matlab (and Octave) and R in scientific computing.
Displacing Fortran was a side-effect of that trend, as now it wasn't about productionizing Matlab code into Fortran, but Python could do general purpose computing adequately as well.
Python excels in several domains. For example, the non-speed-critical numerical computing this article is about. It's also nice for backend web development, and scripting. Embedded isn't one of its strengths, and I'm suspicious micropython was an attempt at bringing embedded programming to people who don't want to learn more than one language.
As the article notes, various numerical kernels have been wrapped as Python compiled modules/libraries, and numpy and other systems seem to work OK for many applications.
People always give the argument that python calls c++ libraries, but I use both Python and c++ a lot, and writing c++ directly, calling the the same libraries, is way faster.
I suspect the reason people claim that is they are training ML models, which may take O(hours ~ days) to run anyways. In this usecase, C++ calling C++ is faster than Python calling C++ slightly in O(seconds) is outweighed by the fact that Python is more convenient for the ML practitioner trying different models.
Python is what has been popular for the last 15 years. Scientists are not programing language geeks, they just use whatever is popular, viable, and established.
Harsh isn't the same as shallow. I don't believe snark is ALWAYS uncalled for, only that you deploy it when it's reasonably necessary. And this is a thing that people, especially HERE, get wrong a LOT. That it's unpleasant or hits people where it hurts doesn't mean it's not important, and I think it is.
I wonder how many days have been wasted on non-programmers trying to get their Conda environment up and running or similar. Half the data science stuff isn't reproducible, not because of the science, but because getting the notebooks running with its dependencies is almost impossible.
I think a lot of this has to do with just how bad/incomplete the docs are, how unnecessarily janky the shell integration is, and how the Anaconda launcher itself makes a huge mess and actively works against best practices.
The docs for building your own packages are even worse, to the point where you basically are left copying snippets from Conda Forge to build anything nontrivial.
Basically Conda is a tremendous engineering achievement, but it's very much still a "first draft" in a lot of ways, and Continuum/Anaconda made some weird decisions that work against its user-friendliness. Imagine for example if third-party repos on anaconda.org could have a description box, link to a homepage, etc...
> non-programmers trying to get their Conda environment up and running
I see this issue brought up a lot, but I have yet to see a language that addresses this reliably. By definition setting up an environment for non-programmer is a tall order, what language should they use?
I'm just grumbling because even I as a professional dev can sometimes spend days getting some python project up and running correctly. Then I feel sorry for non-devs for which all this is only a tool.
The simple and easy Java way to do it is to just bundle everything into a Jar. Then it really is a single file "environment". Then you only have the problem of different Java versions rejecting the jar file because it is too new grumble.
I'm increasingly convinced that the majority of so-called "data science" is pure sciencism with little to no actual science for exactly this reason. It's reading correlations from digital tea leaves.
> Half the data science stuff isn't reproducible, not because of the science, but because getting the notebooks running with its dependencies is almost impossible.
As someone who did scientific programming in other languages (Fortran/C++), I can assure you the nonreproducibility was there in those projects as well. Not because of the tech stack but because no one valued reproducibility.
The current situation with notebooks isn't worse. It's more of the same. I think people criticize it more because notebooks are advertised as reproducible research.
Because it's a hard problem and people love hating on Python because it doesn't come with a way to handle all the compiled dependencies that work for every OS.
Data science has a ton of moving library parts, it is genuinely difficult to distribute precompiled libraries for everyone when you have 2-3 actively maintained CUDA version with 2 cuDNN version for accelerators that change every 2 years. Most team fail to standardize on an environment (say Python 3.8, Ubuntu 20.04, CUDA 11.1, and cuDNN 8) and then get hung up on a dependency not building as if it's Python's fault that it does not have control of your entire OS.
But why is it such a big problem in Python compared to other stacks? Why does all python projects end up depending on you having those exact tools of things installed locally and the planets aligned a certain way, when other stacks do not?
It's not a big problem in Python in general, only in scientific computing / number crunching projects, because of the dependencies on huge complex software, some of it ancient, written in C, Fortran, and C++. So why do we hear about this problem in Python a lot? Well, because it's what's used for the glue/frontend, which is what users work with directly. It's selection bias. Sure, another language might fare somewhat better or worse for this or that reason, but at the end of the day it's gonna be a pain in the ass (at least until next-generation, complete, deterministic, language-agnostic solutions like Nix/GUIX really gain traction).
There are 365k projects in the "official" package index. While not all of these are important, it's a tip-off to the magnitude of the problem. The habit of blowing past a problem by grabbing a random library and moving on to the next problem leaves us with a mess of dependencies. And many of those were either written by amateurs like us, not maintained, etc.
Maybe other languages have fewer libraries, or maybe the habit of grabbing libraries at random evolved concurrently with the rise of Python.
My team has a rule that we don't let a project get past a certain stage without proving that it can be installed and run on a clean machine and archiving all of the necessary repo's with the project. It's easily forgotten that testing your installer is part of testing your program.
I'm not sure that it is a problem in Python more than other languages.
It might look worse because many Python projects use tools such as CUDA, which are notoriously dependent on the specific OS, architecture, method of installation etc. But that same issue will exist in most languages - if you're linking against CUDA, you will sometimes have problems with the package installation. Particularly if you try to run the code on a different OS, CPU architecture, using a different GPU, etc.
I don't think it really has anything to do with Python. It just happens that most people doing work that depends on tricky packages such as CUDA also happen to be using Python.
Python is for all intents and purposes a "glue" language. You don't do the heavy computing in Python, you just pull in a C++ library that has a Python interface. This adds a ton of friction because these dependencies will often not be precompiled so you need to have the right system libraries to build the module before using it.
It's not much of a problem for other stacks because they either are fast enough that they have a library written in the same language for problem X (C#/Java/Rust) or they aren't targeting the same type of work (JS, Ruby, etc...). C++ has the exact same problem as Python and I'd argue that it's even worse.
They have: Nix and Guix for example can handle Python dependencies as well as native dependencies, and can build them reproducibly. They just haven’t caught on yet.
As someone who uses numpy almost daily, I think that numpy is "overextended" beyond its core niche, sure. So - making it work with things outside that niche (e.g. streaming, non-rectangular data, non-uniform data, nonhomogeneous data, etc) is painful. However, 1) there's Pandas for that, and 2) I disagree with "misleading" and "surprising". What makes you think that?
Not GP, but I’m using pandas daily to build up a BI platform within a financial institution. Compared to Matlab and even Fortran it has some issues IMO:
* why distinguish between Series and DataFrame? just give me an interface for m x n matrices or even higher dimensions.
* pure vs. in-place operations. not such a big fan of having multiple versions of the same function, e.g. a more pythonic
df[“my_col”] = series
vs. a more functional
df.assign({“my_col”: series})
; I’d rather have everything like the latter to be able to more easily have best practices in place.
That brings me to another point: if we keep everything purely functional, then python’s syntax is making things a bit awkward. Where in something like JS you could just put every function call with its dot on a new line without the need to assign, in Python this requires putting line break characters or wrapping it in round brackets. This is one place where a language with explicit assignment terminators (semicolons) are a bit cleaner to work with.
All that being said scipy is still a great choice to have both system programming and numerical business logic in one language.
Little things, like some functions want to be called with a tuple of dimensions,
np.zeros((rows, cols))
others just want to be called like
np.random.randn(n, m)
The 1d array is a huge, fundamental design flaw in numpy. It makes zero sense that I can do matrix-vector multiplication against both an nx1 2d array as well as a 1d array. The latter is complete nonsense.
When you slice a column from a matrix, and get a not an nx1 vector, but a 1d array, it makes me want to shell out $10,000 for matlab (yes, I know I can get a column vector with the slice A[:, [2]], but I shouldn't have to).
This problem leaks out into the ecosystem. For example, when you try to use scipy to integrate an ODE, and pass it an initial condition vector that is nx1, the scipy integrator will silently coerce your vector to a 1d array, pass it to your RHS function, which then either blows up, or more likely, produces silently wrong result because of numpy's insane array broadcasting rules.
This problem further leaks into the ridiculous function hstack. If you just used the function vstack, which made a 2x3 matrix from 2 1d 3 element arrays, you might imagine that hstack would produce a 3 x 2 matrix. But no. It creates a 1d 6 element array. For what you wanted, you actually need np.column_stack.
I think the way Eigen handles this is the most intuitive. You do linear algebra with 2d objects, and cast to arrays for elementwise operations.
There is also a huge inconsistency between what numpy exposes as an object oriented interface vs a "functional" interface. What I mean by this, is that I can call x.sum() on an array, but not x.diff(). For that, I need np.diff(x). There seems to be no pattern to what is exposed as a method vs a function.
The array slicing api is also really inconsistent. For instance, given a 3 element array x,
a = x[5]
is an IndexError. However, this perfectly fine
a = x[2:5]
I just can't forgive that this is not also an IndexError.
In my opinion and experience, I think you’re right about “there’s Pandas for that” and “that” can be almost anything. It can do almost anything but making it do almost anything requires constant reference to the docs. And I find maintainability difficult. It seems like there’s 50 kwargs for every method. Sometimes things happen in place by default, other times they don’t. Compound indexes still confuse me. But I’m not a data scientist so I don’t do much ad-hoc analysis that seems typical with pandas users.
I actually don't remember the details, I haven't used numpy in 4-5 years. I remember being bitten a few times by some operators that had a different behavior based on how you had arrived to what looked to be the same data. These were issues I don't remember encountering with e.g. Mathematica, MatLab or R, but then, I was manipulating different kinds of data.
Next time I find myself manipulating numerical data, I'll definitely take a look at Pandas!
Pandas is better than nothing, but I would look to R's dplyr/tidyverse for a really well-designed tabular data manipulation ecosystem. Compared to tidyverse, the pandas API feels bloated, obscure, and inefficient. I often see people using very slow apply-based solutions in pandas because the faster solution is so non-obvious.
The tidyverse ironically ends up feeling more Pythonic, with more of a "there is one obvious way to do it" vibe.
Pandas is some probably very nice and clever Cython wrapped up in disastrous Python. As someone says below, doing anything requires constant reference to the docs, unless you did it yesterday. The semantics given originally to square bracket indexing have unacceptable edge cases with weird fallbacks, and instead of fixing it a bunch of other strange indexing syntaxes have been added on (but any Python programmer will use square brackets first). It's basically a distinct language (and a powerful one if you use it regularly).
jupyter notebooks encourage disorganized, unprincipled programming; chaotic re-running of cells in the face of global mutable state; and prevent budding programmers from learning to use version control because the JSON format was designed without version control in mind.
For Jupyter, it depends on the workflow. Especially with data sciences. In data science, you spend a lot of time playing with the data, testing things, drawing charts, computing, etc. When you do that, the cost of starting a python interpreter, loading the imports, loading the (usually big) data becomes a real pain 'cos you iterate like hell. Working in a REPL becomes really important.
But even more, working with Jupyter allows you to work out a very detailed explanation of your thought process, describing your ideas, your experiments. Being able to mix code and explanations is really important (and reminiscent of literate programming). You got the same kind of flow with R.
As data scientist, I'm concerned about data, statistics, maths, understanding the problem (instead of the solution). I don't care about code. Once I get my data understanding right then comes the time of turning all of that into a software that can be used. Before that, Jupyter really gives a productivity boost.
For the code part, yep, you need other principles where Jupyter may not be suitable.
It's interesting, I never feel like I get these exploratory benefits from jupyter notebooks. I just end up feeling like one hand and half my brain is tied behind my back. I'm most productive iterating in a similar way to what you describe, but in an ipython terminal, running a script and libraries that I'm iterating on in a real editor. If there are expensive computations that I want' to check point, I just save and load them as pickle files.
I have to say I think a jupyter notebook format is a 10x improvement in productivity over ipython. It's just so much easier to work with - and a step more reproducible too, my scribbles are all there saved in the notebook, at least!
really interesting. I may have overlooked IPython a bit (I just thought Jupyter was its improved version). For the moment, maybe like you, I prerpocess the data (which takes minutes) into numpy array which then take seconds to load. But once I add imports, everything takes about 5 or 6 seconds to load everything I need. So Jupyter remains a good idea. Moreover, I love (and actually need) to mix math and text, so markdown+latex maths is really a great combo. I dont' know if one can do that in IPython, I'll sure look!
I've programmed in a number of languages over the past 40+ years, starting with BASIC, and every one of them encourages sloppy coding. The good discipline always has to be taught, learned, and willingly practiced. The closest I came to a language designed for teaching good practices was Pascal.
I find it easier to read and understand bad code written in Python, than good code written in the C family languages.
Yes, what I was saying was that writing Python code in files is a better and more educational way to program than writing Python code in a Jupyter Notebook. It wasn't a criticism of Python.
I use Jupyter a lot, but have a personal rule to do "restart kernel and run all cells" once in a while, to scare up any kind of hidden state or out-of-order execution problems. For instance, if I'm about to leave a notebook for a while, I'll make sure it runs without error from top to bottom.
In that sense, I'm making it work like Python code in a file. The advantage of code in files is that I can use all of the slick code analysis tools that will warn me about my mistakes. I wish there were something that would let those tools go through the code in a Python notebook from top to bottom.
papermill is good and ploomber is a thing to watch.
Ploomber makes it systematic - store notebooks as .py (py:percent files for example), parameterize them with papermill and execute as a batch job. One can view the resulting jupyter notebooks as .ipynb later and produce reports as html if wanted. It's really good already, and better if ploomber gets more development.
The whole reason it works is because it's easy to open the .py notebook and work on it, interactively, in jupyter.
The main idea - jupytext for .py notebooks and papermill for parameters & execution - that's already "stable" and easy for anyone to use for their own purposes.
Maybe I haven't come far enough with my ploomber use to tell yet! It works nicely but I know I'll learn more and open my eyes more as I go.
As a first impression, I eventually found meta.extract_upstream = False which I think is important. Reason: The code for each step should be a lego piece, a black box with inputs and outputs. That code should not itself hardcode what its predecessor in the pipeline is - you connect the pieces in pipeline.yaml. (extract_upstream = False is not by itself enough to solve this, since you also need to be able to rename inputs/outputs for a notebook to be fully reusable as a lego piece, but it's good enough for now.)
I also for my own sanity need to know more about how the jupyter extension part works, how it decides to load injected-parameters or not. But maybe I could learn that somehow from docs.
In general I want components that are easy to understand and plug together and less magic (but the whole jupyter ecosystem's source code feels this way to me unfortunately, lots of hard to follow abstractions passing things around). But it's developing rapidly and already very useful, thank you so much!
I'll ensure we display the "extract_upstream" more prominently in the docs, we've been getting this feedback a couple times now :)
Re: the Jupyter extension injects the cell when the file you're opening is declared in the pipeline.yaml file. You can turn off the extension if you prefer.
Feel free to join our community, this feedback helps us make Ploomber better!
Being able to hack out code to explore and experiment with data while not having to reload and reprocess data (thanks to that global mutable state!) saves a hell of a lot of time in the long run.
The one place matplotlib sucks is any kind of interactivity. But other than that, matplotlib has the best, most intuitive interface of all the python plotting libraries I've tried. It's also one of the few libraries that doesn't rely on generating html for a webbrowser, which makes for a miserable workflow.
I still think Matlab's plotting is untouched by open source options.
Matplotlib hides a lot of complexity if you ask me. As soon as you do something in a different way than intended you're off searching stackoverflow for a post that did something similar to what you want. Then you tweak it a little and hope it works.
Not infinity, but yeah it's worth more than people generally think. But in the end you don't really lose many clock cycles anyway because everything actually runs in C/CUDA/etc. behind the scenes
This feels misguided, too. It begs the question that python has good usability. Anyone that has tried managing dependencies in it will know that is mostly a lie.
What python had, was that it was preinstalled on many computers and then had a large cohort of users that are insisting that others use it. And mostly force proclaiming that it is easy and readable.
I'll not claim that it is hard, per se. More that it is not intrinsically easier than any other dynamic language.
For evidence, the main packages that are popular are often clones of packages from other environments that were not widely installed. Jupyter can be seen as free version of many scientific applications. Matlab, Mathematica, etc. Matplotlib is rather direct in it's copy. Pretty sure there are more examples.
Is managing deps in Python a pain? sure is!. Is it a pain in the other contenders of easily available dynamic languages? Yup. So that's a wash. Managing deps in dynamic languages is not a simple problem, I can't say I've tried one that did it super well yet.
I'm... Not sure insulting other languages is a path to victory.
Npm, as much as it annoys me, is light-years ahead of anything in python. Quicklisp is rather pleasant, now. Ruby has had gems for a long time.
I grant that it is a hard problem. I am not griping that it is not solved. More that the option community has largely failed to even pick a direction. The most used dependency methods are, in the modern spirit of python, deprecated already.
> Npm, as much as it annoys me, is light-years ahead of anything in python.
Hence PEP 582:
> This PEP proposes to add to Python a mechanism to automatically recognize a __pypackages__ directory and prefer importing packages installed in this location over user or global site-packages. This will avoid the steps to create, activate or deactivate “virtual environments”. Python will use the __pypackages__ from the base directory of the script when present.
Right. I am confident they will get things in place to solve this. That it took so long for them to want to strengthens my assertion that the community was piggy backed on the machine package management for a lot of its initial popularity.
It isn't that similar. Most languages actually have solutions that the community are pushing together. Python alone bungled a major version change and then refused to endorse a package management system for so long.
Yeah it’s counter-intuitive, and it’s because it does not make much sense.
Slowness is one thing, but the tooling is also clearly subpar compared to languages of the same popularity, the dynamic typing makes things difficult to maintain, the 2.7 vs 3 shit show etc. etc.
The very fact that many smart people have been saying for years that Python is a fairly bad tool for data analysis should at least raise some people’s eyebrows. But no, the entire field of data science has decided that it knows better…
Python won because people who knew math/science domains only knew Python (or it was the best they knew). And so they made libraries for Python. And it propogated like many other bad ideas based on ignorance.
Python is a miserably bad language for modern times. If you know any of half a dozen other languages, then you understand.
There was a good essay, from Paul Graham?, about the ladder of awareness of programming languages. Unfortunately I can't find it now.
The point is, Python has won and is frankly terrible. It has inconsistent features, but it has an awkward OOP approach (in a time when OOP is finally being recognized as bad itself), as well as seriously lacking basic language features which are only appearing as of 3.9 and 3.10.
Frameworks like Django and Django Rest Framework expand on these bad ideas, creating monstrosities which make the PHP code of yore look arguably decent.
Sadly, I don't think there's any way to kill this. The only option is to vastly outperform the Python people and produce reliable, readable, performant solutions in half the time and beat them to market. Perhaps someday they will die off.
One of the reasons for them knowing it in the first place was false marketing that python "reads like English" (as if that would be a good thing).
The problems with these really smart people is that they hate not knowing everything... and lot of them a decade or two ago were never exposed to programming before they started doing research... so when they hear that it "reads like English", they feel that they can conquer it... eventually, being the smart people they are, they learn enough to get their jobs done... and some of them learn it quite well, while some others write terrible code that somehow works but they themselves don't quite understand why. But most of them would not take on another language unless someone comes up with a false claim that "it is easier than English" or some bs like that.
> One of the reasons for them knowing it in the first place was false marketing that python "reads like English" (as if that would be a good thing).
Sounds like the story of BASIC (and a bunch of other early languages besides - BASIC was originally a simplified variety of FORTRAN, with a REPL terminal-driven workflow tacked on as a key innovation), except that Python is a lot more semantically complex than BASIC, even at a novice level. Perhaps we might use some development tools that can make, e.g. Rust read "like English", too. (After all, the Rust compiler diagnostics literally read like nicely-phrased English, so extending the same approach to the rest of the language representation has some meaningful precedent.) Then novice scientists might learn to program by tweaking their code and reading what it actually means, pretty-printed in nice English language.
At one of my previous companies we hired a really intelligent (generally) and very likeable data scientist who arrived and wrote the worst Python I've seen. Actually it wasn't the worst, because it was all inline copy/paste code rather than convoluted OOPish code.
He could solve very complex problems, but his tooling was horrible. It wasn't his fault. He had solid education from a German STEM uni (with PhD.), but there was a serious lack of programming skill.
It would seem that because Python is "so easy" to get started that people don't feel they have to bother with learning any real programming skills beyond solving their immediate problem.
I don't blame this on the scientists; software is not their domain. The problem is with PHBs who don't know better and who make decisions based on the toolset used by the "special" people.
> Python won because people who knew math/science domains only knew Python.
This doesn’t explain why they knew Python in the first place, a pretty critical step. It reached popularity without a platform mandate (JS, Swift) or corporate backing (Java, Go) so there’s something going on.
Python has a lot of great libraries. It's the inverse of the chicken-egg problem.
Because there are now some critically important libraries (pandas, numpy), it means that is the obvious starting place if you want to hit the ground with minimal effort. I think that's totally fine for uni. But there should be a capstone level class for data/ai scientists before they can graduate which shows other languages and teaches some general best practices of software development.
There are plenty of other languages which can do the same job. And honestly, the algos which are available can be recreated if they don't exist. Most of it is not "rocket science".
But the greater problem is that Python itself is a poorly designed and warty language. Whether a scientist or not, choosing Python means fighting these warts. No amount of make-up can cover some of these; and plenty of other languages start with clearer foundations.
>This doesn’t explain why they knew Python in the first place, a pretty critical step.
>>Python has a lot of great libraries. [...] But there should be a capstone level class for data/ai scientists before they can graduate which shows other languages
I didn't downvote your gp reply but your answer just pushes the question to an earlier point. Why did early earlier 1995 programmers at science labs like David Beazley and Jim Hugunin (who already knew "other languages" such as C Language, assembly, Fortran, etc) ... choose Python as the scripting wrapper for their C code? See my other comment about their earlier history: https://news.ycombinator.com/item?id=30813528
The "Python having a lot of great libraries" wouldn't have been a compelling reason for David Beazley since those earlier creators of scientific packages for Python chose Python before it had a lot of scientific libs. They were among the very first.
Here are some bullets from another deep link at a different point in the video[1] :
- David's first attempt writing his own homegrown scripting language.
- he also looked at alternatives like Tcl/Tk and Perl and they weren't as appealing as Python.
- David mentioned Python had a more powerful REPL.
- Python was also open source C code so he could easily modify it to run on the Thinking Machines CM-5 computer[2] in the physics lab
- wanted a language & runtime that encouraged the wider community to build more science tools
In your opinion, what was the superior programming language that David Beazley and Jim Hugunin should have chosen in 1995 that checks all the bullet points above?
Python is more accessible in modern times than C, assembly, and Fortran.
But we are SO far past that now. My argument isn't for what should have happened in 1995, it's for the complacency which has allowed Python to become the top 1 or 2 language in 2022. It's like having proximity detectors on the back of your car, but you still start the vehicle with a crank at the front. We can do better; we have the technology.
> "A program that performs a useful task can (and, arguably, should) be distributed to other scientists, who can then integrate it with their own code. Free software licenses facilitate this type of collaboration, and explicitly encourage individuals to enhance and share their programs. This flexibility and ease of collaborating allows scientists to develop software relatively quickly, so they can spend more time integrating and mining, rather than simply processing, their data."
https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
Now there isn't any area of molecular biology and biochemistry that doesn't have a host of Python libraries available to assist researchers with tasks like designing PCR strategies or searching for nearest matches on up to x-ray crystallography of proteins.