The Open Source Data Science Masters

mailshanx · on Oct 12, 2014

Like many people here have mentioned, mastering everything can take far too long, and is completely unnecessary in practice.

It is much, much more important to master the foundations, which consist of: 1) Probability and statistics 2) Linear algebra, and 3) Programming proficiency

To be more precise here is what you would need to know in each area:

1) Probability and stats: Properties of the most common distributions (Normal, Poisson, Exponential), conditional probability, properties of expectation, Bayes rule and its applications, hypothesis testing, confidence intervals, bootstrap / jackknife, know how to compute the power of a test, be able to run Monte Carlo simulations to model real world phenomena.

2. Linear Algebra: Know enough to understand and compute Eigenvalues / Eigenvectors, SVD and various matrix factorizations.

3. Programming proficiency: Algorithms and design patterns. Its enough to know the typical undergraduate algorithms course content: trees, heaps, hash tables, various sorting algorithms, graph algorithms: shortest paths, spanning trees, max flows.

If you know this, you can easily pickup everything else on a need-to-know basis. Depending on the domain you are working in, that might entail either natural language processing, computer vision, machine learning or big-data tools like Hadoop / Spark etc. All of that is very easy to learn once you have the foundations.

ska · on Oct 12, 2014

One thing I think you are missing here is

4) Elements of numerical analysis

Under standing how matrix inversion works on paper is very different than a computer implementation. If you don't understand how floating point really works, you are bound to make fundamental errors in this kind of work, etc.

jeffreyrogers · on Oct 12, 2014

Are most people doing data analysis actually going to be implementing their own numerical algorithms though? I'd imagine most people just use whatever library is available in R, Python, or whatever other language they're using.

agibsonccc · on Oct 12, 2014

Oh yes. Numerical underflow is such a common problem when you're training machine learning algorithms. If you don't understand floating point numbers not only will you get inaccurate results, you won't be able to troubleshoot why.

Understanding the difference between time and space trade offs for floats and doubles (this is why even something as high level as numpy allows you to specify floats/doubles as a datatype) will not only allow you to trouble shoot things, but it will make your life a lot easier when debugging.

clarecorthell · on Oct 12, 2014

Of people who use these tools, only a very small percentage of them also build said tools. Which makes sense. In industry, companies want to build meaningful products and data sets that provide value which can be exchanged for money. That means delivering some majority of the value in the shortest amount of time possible. For that reason, engineering teams have little room for R&D, including the kind you're hinting at. Scikit, for example, has largely academic contributors (http://scikit-learn.org/stable/about.html). Not surprising. Talents of pedantry, proof-writing, and pushing the boundaries of theory probably lies in academia as opposed to industry. Industry rewards shipping code (useful oversimplification) and academia rewards novel theory (yet another oversimplification).

Solving problems requires understanding what solutions exist (and whether they can be used, must be built upon, could be used in ensemble, etc). Choosing among those solutions requires understanding (to some varying degree) why and how the solution solves the problem. Choosing the correct out-of-the-box solution is not trivial.

One very real danger here is unwittingly lying with statistics. Which is arguably worse than wittingly doing so.

RBerenguel · on Oct 12, 2014

It's important having at least a basic knowledge to avoid common, really stupid pitfalls that libraries can't handle out of the box but humans can (i.e. non-interpolable functions, small number division...)

ska · on Oct 13, 2014

Even if you aren't implementing things yourself (and past pretty basic usage, you'll have to do at least a little), you need to understand how these things work at so you can tell when or why the packages you are using are failing, and what to try after that. Beyond fairly basic modelling, almost none of this stuff "just works"...

geebee · on Oct 13, 2014

These are good suggestions, and you're presenting a good guide for people who need to fill in some gaps prior to "grad school". But to me, this sounds like the standard (in the US) undergraduate curriculum for the general STEM majors - Math, Physics, Operations Research, CS, various branches of engineering. Typically, a student would have done what you're describing prior to grad school.

A Masters degree is typically an opportunity to specialize a bit and have a little more space to pursue an interest.

One suggestion I'd make is to go through the python scikit-learn library, do the examples, and research the math behind the algorithms to make sure you understand what's happening in the background (maybe even implement a bit on your own). I would agree that if you have the background you described, you'll be able to do this (though it'll take some study).

http://scikit-learn.org/stable/

One other thing… I'm a little less enthusiastic about learning purely on a "need-to-know" basis, because sometimes it takes knowledge of an algorithm to recognize the opportunity to apply it. I don't think you can keep all of this loaded in "exam-ready" memory (and I really wish interviewers would understand this), but I think it's important to study and read widely. You can learn (or review) the details in the moment, but you do need broad awareness to approach data with a certain frame of mind.

upquark · on Oct 12, 2014

The list of topics at the end isn't necessarily "very easy to learn", but I do agree that fundamentals are essential.

Depending on the mathematical background of the individual, they may need calculus and general mathematical maturity / ability to solve non-trivial math problems before venturing into the data science field.

Also databases. Absolutely need this in the fundamentals, arguably more so than graph algos etc.

Edit: English

NhanH · on Oct 12, 2014

In that veins, I'd love to see a list that focused more (or even exclusively) on the fundamentals. Do you have any such recommendation?

mailshanx · on Oct 12, 2014

Expanding on my previous comment (https://news.ycombinator.com/item?id=8444705), here are some excellent, tried-and-tested resources to help you learn the fundamentals.

1. Probability and stats: For probability, work through John Tsitsiklis's MIT OCW course 6.041. There is also an archived version of the course available on EDX. Norman Matloff's book (http://heather.cs.ucdavis.edu/~matloff/132/PLN/ProbStatBook....) approaches stats from a CS practitioner's point of view (lots of sample simulation code in R), and has intuitive explanations of many different statistical concepts. For Hypothesis testing, checkout the courses Mathematical Biostatistics - 1 and 2 on Coursera.

2. Linear Algebra: Gilbert Strang's OCW lectures / course notes is hands down the best resource. But depending on your mathematical background, the lectures can be a bit long-winded.

3. Programming proficiency: For algorithms, Robert Sedgewick's algorithms course on Coursera has excellent videos. His textbook is very well written too. I really like his style of writing actual working code for each algorithm he describes (in contrast to say, CLRS): i find his code samples rather elegant and worthy of study. For learning object orientation and design patterns, the best way is to write large programs and read the book "Head First Design Patterns" as you do so.

The aim is to develop an ability to think with a statistical / algorithmic lens. Many real world phenomena are stochastic in nature. Modelling them by simulation and measuring and drawing confidence intervals around your measurements must become second nature. When confronted with an image / text processing problem, representing you data as vectors and manipulating them must be natural. While writing large programs, being aware of design considerations such as preferring composition must come naturally to you. Of course, you should be able to always recognize when a computation is O(n^2) vs O(n log(n)) vs O(2^n).

Once you are done with that, all you need to do is work through Andrew Ng's Machine Learning course and compete in a few Kaggle competitions / build some ML related side projects. Congratulations, you are now a data scientist good enough to work at most places in the tech industry!;)

_qc3o · on Oct 12, 2014

Here's the problem with stuff like this. If you actually learn all the stuff that is listed it will take you at least 3 years if not more. All the subjects listed require prolonged study if you want them to become part of your intellectual repertoire and then it takes some more time until you are comfortable applying that knowledge in non-trivial settings. On top of all that you have to stay motivated to consistently study and make progress alongside whatever else you are doing for your day job.

I program every day and I'm been studying parsers, virtual machines, and compilers for some time now and it still requires hard work to not fall back on bad habits. It takes even more effort to consistently apply techniques I learned while studying those subjects in day to day programming. There is no royal road.

clarecorthell · on Oct 12, 2014

It's worthwhile to explain why many people find the OSDSM useful.

The topics listed do indeed require depth to acquire an adequate proficiency to understand and command them. The OSDSM is most useful for people who don't know where to start, which is often the hardest part. The hardest things are often the most trivial in hindsight.

Typical interaction:

Q: I want to be a Data Scientist. What do you think I should study first?

A: Build a basic proficiency in linear algebra, programming in python, and statistics. Then take cursory classes in the subjects in the OSDSM that interest you. You'll figure out what depth is most meaningful to you from there.

Q: What does a Data Scientist actually need to know?

A: Totally and completely depends on what you want to do. There are people who crunch click logs all day, people who comb the tendrils of search algorithms, and yet others who seek terrorists and criminals in statistical signs among the bits and bytes. See my last answer for relevant guidance.

e.g. https://medium.com/@clarecorthell/the-brief-multi-tweet-enum...

_qc3o · on Oct 12, 2014

In that case I have some feedback. As a beginner I still don't know where to start after reading your post. There are 101* bullet points in your post distributed among IDEs, books, online courses, libraries (the programming kind not the book kind), programming languages, pure math, applied math, database theory, machine learning, natural language processing, visualization, etc. and there is no obvious ordering on how hard or easy things are because the very first link in the math section is another list of bullets tackling some heavy-duty mathematics and I say this as someone who studied mathematics at a graduate school level.

Expanding mailshanx's answer will be much more helpful to beginners than the current 101 bullet points.

*I just counted the bullet points with document.querySelectorAll('li') so there are some false positives.

clarecorthell · on Oct 12, 2014

Start here: http://bit.ly/uwintrodatascience. It's the first bullet.

datashaman · on Oct 12, 2014

Doing something is more productive than doing nothing, regardless of how easy or hard it is.

_qc3o · on Oct 12, 2014

I'm not discouraging you from taking action. My point was the current document is not beginner friendly if that is the intent because it is much too exhaustive for a beginner and is more likely to discourage than encourage them to continue as soon as they hit the first roadblock.

clarecorthell · on Oct 12, 2014

Deterrent roadblocks are a huge problem. In fact, it's the primary reason education can't happen solely through a book or video lecture series. Students need mentorship to gain motivation and remove roadblocks. Teachers, TAs, and cohorts fill this need in the university setting. It makes sense that when you learn something new, you face more unknown unknowns than known unknowns. It's nearly impossible to ask questions about the former because you don't know how to begin formulating a question, while the latter is googleable and likely solvable.

The programmers' standard roadblock remover? google -> "stackoverflow" + problem

dantiberian · on Oct 12, 2014

Here's my personal list for those interested, inspired by the Data Science Masters: http://danielcompton.net/2014/03/23/data-science-curriculum

huherto · on Oct 12, 2014

The field of Data Science is so wide and there are many techniques. How do you decide on what to specialize since you don't know which techniques are going to be needed to solve your next problem?

clarecorthell · on Oct 12, 2014

The truthful meta-answer which you probably don't want to hear: We only have a statistical likelihood (of dubious confidence) of what will transpire in the future. We have no knowledge of the future. None. Same is true of your knowledge of your future problems. Use some basic bayesian logic and guess in an educated manner, like with most prediction problems.

More practical advices: Find out what problems you're interested in working on, then google, read books, engage people working on those problems to tell you more about how they solve them. Another great place to start is to look at your business' biggest inefficiencies and informational gaps, and determine which of them could potentially be addressed with prediction or statistical inference. This used to be called Business Intelligence. Even a coffee shop can benefit from understanding simple seasonality.

akbar501 · on Oct 12, 2014

The list of Python libs is very helpful. Thanks for putting this together.

clarecorthell · on Oct 12, 2014

I'm curious about the audience here -

What career goal is driving your interest in learning Data Science?

KrisAndrew · on Oct 12, 2014

It's more of a response to the market. Beginning about 5 years ago I was increasingly asked to do more numerically oriented things. Prior to that I was mostly writing applications that generated SQL and wrapped the results in some HTML. Pretty boring. Data science is more compelling.

Over the past 15-20 years there has been a massive amount of information piling up in databases and log files; not just from web applications but from desktop and mobile apps too. And there are companies who want to pan for gold in that data. So if you want to do something more interesting than fiddle with canvases, or CSS or MVC frameworks, then data science is fairly accessible if you're not afraid of math. Furthermore, most companies will have a need for it even if you don't really care to develop their software products directly.

NB: I doubled my salary by moving into data science. Nowadays gas station attendants can write a Rails app to search a database. The bar has been lowered. Understanding stochastic gradient descent (among other things) and knowing where/when to use it commands more earning power.

dasboth · on Oct 13, 2014

"I was mostly writing applications that generated SQL and wrapped the results in some HTML. Pretty boring." - This is basically why I'm interested in making the move (eventually). Earning power may turn out to be a perk but the main motivation is something that is perpetually stimulating.

shire · on Oct 12, 2014

how does someone who has a responsibility and dependents keep up with the time to learn all this stuff? time is of the essence.

clarecorthell · on Oct 12, 2014

I took six months off for the OSDSM. I am very aware that it was a luxury to do so. I had to take loans, move out of my apartment, and I studied 10 hours per day.

If you weren't working 10 hours per day, 6 days per week for six months (1440 hrs), and you took 6 hours per week to do the same, it would take you more than 4 years to finish a similar curriculum.

Managing your time is still the hardest part of self-study. It always will be. Most people pay institutions to structure their lives with a workload, deadlines, time off, expectations, and consequences (positive and negative). Having the time for self-study is a luxury few people have; most people won't have the time, money, and opportunity to give up for a "classic liberal education," and by extension they won't have time for self-study, either (one such conversation on the topic: http://www.newyorker.com/culture/culture-desk/loud-nathan-he...). That's part of why new forms of education like The OSDSM are so necessary. Such curriculums fit exceptional cases that are less and less the exception.

jeffreyrogers · on Oct 12, 2014

If you constrain a problem enough it eventually becomes impossible. There is no royal road to learning hard things, you just have to put in the time.

SpaceManNabs · on Oct 13, 2014

It is actually not too bad. I picked up all of these things in my undergraduate studies and some side classes.

afdfdfdfd · on Oct 12, 2014

Is there something similar but for a CS degree?

clarecorthell · on Oct 12, 2014

Try CodeAcademy and CS106a (how I learned programming at Stanford) https://www.udemy.com/cs-106a-programming-methodology

I've started cataloging some of the best beginning resources here. Main benefit is that people can battle out what the best resources are in PRs. https://github.com/datasciencemasters/go/blob/master/basic-p...

rkda · on Oct 12, 2014

You can check out Bottom Up CS

http://www.bottomupcs.com/

upquark · on Oct 12, 2014

Thank you for putting this together, looks like a great collection of resources.

misiti3780 · on Oct 12, 2014

no ggplot under visualization - what's up with that ?

clarecorthell · on Oct 12, 2014

You'll find R resources here: https://github.com/datasciencemasters/go/blob/master/r-resou...

The OSDSM maintains a focus on python resources.

rkda · on Oct 12, 2014

It's open source. Time for a pull request? ;-)

https://github.com/datasciencemasters/go