Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Dear god, so much antipatterns on a single page.

Interleaving configuration management in the business logic code is a massive technical debt waiting to engulf future generations in a world of maintainance pain.



The whole Data Science / Jupyter ecosystem reminds me so much of the old days of "MATLAB to C++": the Python ecosystem was supposed to be the best of both worlds and ease the transition from prototyping to production. Laziness won once again.

Data scientists need to be trained with software engineering skills, software engineers need to be trained with data science skills. This is the only way we do not end-up with non-sense like this.


> Data scientists need to be trained with software engineering skills, software engineers need to be trained with data science skills.

It's a nice idea, but it turns out to be a pretty big ask. Particularly at my employer, where a large proportion of new hires come straight out of college. Data scientists have usually studied something like economics, math, statistics, or physics; most of them haven't been introduced to software engineering at all. We try to bring new hires up to speed, but there's only so much we can do with a series of relatively short sessions on Python and git.

Similarly, software engineers don't necessarily have the requisite background to understand the kind of work data scientists do. They'll have had a few semesters of calculus but it's likely they won't have had much if any exposure to data analysis or machine learning. They might not have even had a stats course in college. Further, in my experience they have had little inclination to understand how data scientists work, nor how their software products may or may not fit data scientists' needs.

Opining for a moment here ...

I've had the privilege of working for a few years at a position that kind of straddles the line between data scientist and software engineer (though I was technically a data scientist), and part of that job was mentorship and training. Getting good code out of data scientists and software engineers can be tough to do. I've seen nearly as much messy, uncommented, unformatted, unoptimized code from engineers as I have from data scientists, it's just that when I make recommendations to data scientists they'll actually listen to me.

I'm just lucky engineers started finally using the internal libraries I maintain rather than their own questionable alternatives (though if I never have the "why aren't you pinning exact dependencies for your library? My code broke!" discussion again it'll be too soon.)


Would you have reccomendations / pointers for this?

I straddle both worlds - I’m much more on the tech side, but sometimes interact with scientists / MATLAB codebases.

What data science skills / methdologies would be useful for me to learn?

P.S. And what did you mean by MATLAB to C++? That was a specific time frame (I suppose in the early 2000s ish) when C++ was taught to scientists in the hope they’d be able to productionize their MATLAB code? With not great results (i.e. C++ learning curve + lack of software engineering skills…?) Thanks!


From my experience in the fields of DSP / Data / AI during the last 10 years, issues arise when product teams are segregated into jobs (one guy for initial prototype, then one guy to prep the integration, then one guy for the ops side of things, etc.): people need to be interested and involved in the product they are building end-to-end! Yes this is more demanding, yes this requires perpetual training, but gosh it is rewarding!

My take (non-exhaustive) with the current ecosystem is to apply Agile and DevOps methodologies: - Use Git everywhere all the time, always - Use Jupyter early one: great for quick prototypes & demos, keynotes, training material, articles - Once the initial prototype is approved, archive Jupyter notebooks as snapshots - Write functional tests (ideally in a TDD fashion) - Build and/or integrate the work into a real software product, be it in Python, C++, Java, etc. - Use tools for deterministic behavior (package manager, Docker, etc.) - Use CI/CD and GitOps methodologies - Deliver iteratively, fast and reliably

And by "MATLAB to C++" was a reference to a time (2010's) when corporation were deeply involved with MATLAB licenses, could not afford to switch easily to Python and lots of SWE with applied math background had to deal with MATLAB code written by pure math guys without any consideration for SWE best practices and target product constraints. Nowadays, if the target product is also in Python, there is way less friction, hopefully :)


What’s your recommendation in terms of tooling for cases where it’s not just prototype -> production, but an iterative process? I love notebooks for prototyping, but I find it’s a lot of work to make sure notebook code and prod code are in sync. Maybe just debugging with IPython?


When you've "productionized" a part of your notebook into a Python module, refactor your notebook to use the module instead. Usually, the notebook code will shrink by 80% and will switch to model documentation and demo.


Yeah, that’s basically what I do, but I often find I need to play around with intermediary data within functions.


I create my own classes for this. (Essentially to do the same thing as sklearn pipelines, but I like creating my own classes just for this debugging/slowly expand functionality reason.) Something like:

class mymodel(): ... def feature_engineering(self): ... def impute_missing(self): ... def fit(self) ... def predict_proba(self) ...

Then it is pretty trivial to test with new data. And you can parameterize the things you want, e.g. init fit method as random forest or xgboost, or stack on different feature engineering, etc. And for debugging you can extract/step through the individual methods.


This is a blind guess here, but if you need to inspect the inner data of your function after writing it, it might mean the scope is too broad and your function could be split?

This is where standard SWE guidelines could be of help (function interfaces, contracts definition, etc)


> Data scientists need to be trained with software engineering skills, software engineers need to be trained with data science skills

That's kind of like saying the solution to liability issues arising in the practice of medicine is for physicians to learn lawyer skills and lawyers to learn physician skills.

It's a great idea, if you ignore the costs to get people trained and the narrowing of the pool for each affected profession.

Heck, it's hard enough to get software engineers who work almost entirely on systems where a major part of the lifting is done by an RDBMS to learn database skills.


Yeah training data scientists seems like the answer but in reality it’s just not feasible most of the time. Data science is really hard, and good engineering is really hard. Very few people can do both well.


Scientific code in the Python ecosystem is horrible in general. Architecture astronauts, stack overflows from recursion, bloat, truncating int64 in casts to double, version incompatible mess due to the latest features at all costs. I have seen it all. They treat Python as if it were a sound language with the guarantees of Haskell.


Are there good examples (established libraries or projects) of scientific code in Python? And more broadly, would you have examples of what good scientific code could/does look like?

Not being facetious! I’m genuinely curious and would love to learn more. Thanks


https://github.com/scverse/scanpy

This is incredibly popular in single cell analysis


The Scipy project is a good example, with code from many scientific domains.


What other antipatterns are there? I can understand the criticism of having configuration code with business logic, and think there should be better commenting/docstrings.

But other then that, thought it was fine. The use of type hints is pretty awesome, in particular. Do you guys just not like higher-order functions?


Type hints in that code are erasing the actual signature of the function, so you won't get type checks on the arguments. For @redirect, you'd be better served by using the logging module.

Higher order functions are ok, but the problem with this what they're using it for. Code that behaves different based on environment without any explicit warning for the caller? That's fairly dangerous.


> Type hints in that code are erasing the actual signature of the function

True enough. For the reader of these comments, it's possible (but non-obvious) to properly respect type hints in a decorator. See here: https://mypy.readthedocs.io/en/stable/generics.html#declarin...


I'm so glad I'm not the only one. I was reading the code thinking this seems like a different language to the one I use (albeit I'm not too experienced).

That seems obtuse to me in the extreme. Maybe I'm just not used to that style however.


It's example code in a blog post.


Unironically that means it will end up in hundreds if not thousands of production codebases.


I'm not sure it's framed like that, it's specifically talking about production patterns


'production' for many kinds of data analysis is not the same kind of production as for, say, a web app or service. You can have a production environment for running ad hoc jobs.


That seems a little unlikely since implementing business logic is not typically what a data scientist does.


Not as far as the data scientist in question is concerned, anyway.

I'm reminded of that joke:

Business logic (n): your stuff, which is shit, as opposed to my code, which is beautiful

I say this as a machine learning engineer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: