Hacker News new | past | comments | ask | show | jobs | submit login
Getting started in data science (treycausey.com)
70 points by treycausey on June 8, 2014 | hide | past | favorite | 32 comments



While all reasonable points, his resume doesn't actually list any experience in industry. There is certainly a lot of data analysis skills that can be learned while at academic institutions.

I would argue, however, that a skill that is often not put on these lists of 'what you need to become a data scientist' is some time being at a real private sector company. In my experience, there are quite a few differences between writing academic papers and coming up with the terse, actionable information that is useful for a profit-driven company.


Trey is a data scientist at Zulily. Think he's been there for a few years now.


This is actually a very timely article to pop up for me. I'm currently a Sales Analyst at a big Telco in Canada. I landed the job with no previous analyst experience, but years of programming behind me. I never had any formal programming experience, and everything I know is from years of experience. Because of this, I'm fairly confident that I can get anything done, but I have no schooling to support me. Don't get me wrong, I don't consider myself a highly skilled or knowledgeable programmer, but I know how to break down complex tasks and produce results. I've been writing functional PHP my whole life since I'm unable to focus on learning a new language long enough to actually do so (and because of this, I have lots of self doubt that I'd actually be able to learn a new one if I wanted to).

I consider my 'Sales Analyst' role an intro to data science. Essentially, we're collecting all types of data from different sources and producing reports on them. How many products were sold? From what channels? From which sales persons on which team? How many customers did they talk to? How many of their booked installations failed? Were successes? Whats our market penetration like? Where are we seeing trends in disconnections or connections? Why? We're dealing with millions of records, from half a dozen data sources, often with no unique key matching the data up.

It's a very interesting job, but without any previous experience in the field it's all been self taught. I've looked into a few data science programs and books, but I've felt that everything I see and read is so far above me that I can't even get started on the books or courses. My math skills are pretty basic, but that's never stopped me before as I've always been able to find the results that I'm looking for. I know that with the proper knowledge and experience I'd be able to land a data science job and get a 30-50K raise at the same time, but just don't have anything on my resume to qualify me. Nor do I know what an -actual- data scientist does, so I lack the confidence in even pursuing any openings I see around. I'm getting married next year, have a seven year old daughter, and bills to pay like anybody else, the job security I have in my current role definitely acts as a deterrent from taking a leap into the unknown as well.


I consider my 'Sales Analyst' role an intro to data science. Essentially, we're collecting all types of data from different sources and producing reports on them.

I know ... I'd be able to land a data science job ... but just don't have anything on my resume to qualify me.

Who uses the reports you produce? What decisions do they make as a result? How often do they make these decisions? What is the impact?

If you want to build on your current role as a way to become a data scientist, in your current company or elsewhere, perhaps you can look at ways to:

- Generate recommendations based on the reports you've designed

- Automate some decisions that are being made manually now

- Figure out a way to influence additional teams who could/should use the data you're analysing

In short: figure out a way to generate additional $$$ impact, or improve/automate an existing process which requires manual effort, so that those people can focus on stuff which cannot easily be automated.


Pretending to be a scientist without doing the academic work means you are essentially a trade worker. Autodidact? Fine, but you better have put yourself through the equivalent of at least a masters or phd in statistics along the way.

We need vocational education to let semiskilled workers use statistical modeling tools in an informed way, but why do we then call them "scientists"? They are technicians.


In the US it's perfectly alright to call yourself an engineer or scientist. You're welcome in professional organizations like IEEE and ACM, you may submit articles for journal publication, and attend our even present at conferences. As long as you don't try to fashion your own PE or PhD out of nothing all the rest is fine.


The problem here is, IMO a semantic one.

"Data scientist" is giving people the wrong impression. What this demand really is is companies now have a lot of data because everything is digitised. They need people to do stuff with that.

The actual demand for data related work is to data science like the actual demand for computer related work is to computer science. Statisticians, analysts, database engineers.


>And certainly not quickly enough to be qualified to get a job as a data scientist before the data scientist salary market comes crashing back down to earth.

Do you think it's worth it to pursue if you're legitimately interested (as opposed to primarily attracted to the $$$)?


Great resources, was just about to buy a book on Linear Algerbra & ML before I read this.



What are people using Linear Algebra for in Data Science? Aside from the stock representing words as N-dimensional vectors I mean?

I ask because I do this kind of work as my J.O.B. and every skill he refers to I totally see the necessity of except this one.


In pretty much any statistical model with more than a handful of parameters (and almost all machine learning models do require more than a handful of parameters!), those parameters are represented as vectors or matrices. Linear algebra (and multivariate calculus) then become very important for reasoning about those parameters, fitting the models, making predictions and so on.

The multivariate Gaussian distribution is a great example of this. It's probably the most fundamental and important distribution in statistics, and working with these distributions is pretty much pure linear algebra -- quadratic forms over vectors of parameters, eigendecompositions of covariance matrices etc.

Even for non-statistically-motivated data mining: any time you're optimising over a lot of parameters, it's likely that linear algebra (and, as before, multivariate calculus) will help. Linear algebra is as important to calculus over multiple variables, as plain old high-scool algebra is to plain old univariate calculus.


This has been the most common question I've gotten. I'd say partially it's my bias as someone who works in recommender systems a lot. Many/most common statistical problems have a compact matrix representation (e.g., systems of linear equations). Finally, everyday matrix decomposition tasks like SVD.


Thanks for the response. That gives me something new to learn/look into.


Linear algebra is necessary for understanding linear regression and most clustering/prediction models on a mathematical level. The truth is, though, you can do data science without understanding the underlying math insofar as you do understand your objectives and the meaning of the conclusions you draw.


https://news.ycombinator.com/item?id=4635274

http://aix1.uottawa.ca/~jkhoury/app.htm

The linear algebra text by Anton (9th and 10th eds) has a huge section on applications of LA. Also this book does same for ODE's http://www.amazon.com/Topics-Mathematical-Modeling-K-Tung/dp...


I am actually _very_ puzzled by by this comment because its the polar opposite of a point of view that I would have expected. In fact I can think of very few datamining and machine learning algorithms where linear algebra does not play a role.

Representing features of a datapoint as a vector, pervades and populates every pore of this field. Without an understanding of linear algebra you wouldn't have support vector machines, no kernel methods, no neural networks, no perceptrons, no gradient descent methods, no Newton / Quasi-Newton methods, no multi-dimensional (or as they say in statistics, multivariate) Gaussian random variables, no matrix factorization, no Pagerank, no Markov chains, this list can go on and on.

Take the simplest of data science problems: you have one variable x and another variable y and you want to predict the value of y given x. Usually x is not a single scalar but n scalars (called a feature vector). Simplest thing you can do here is least squares and that is as linear algebraic as you can get. There many fancy ways of dealing with this problem but almost always it is reduced to solving a related linear system.

The bottom line is this: we understand very few things. Thankfully linear algebra is one of the few things that we do understand, so almost every analytical problem is reduced to this case (if, but locally) and then solved.

I would be very curious to know how you have been able to avoid linear algebra. It will give me a new and valuable perspective, because apart from "click button, didnt work? ok click the next button" data analysis I find it hard how one can do much data analysis without it. So please break my bubble, I will be thankful for it.

Canned packages often do not work out of the box. The knowledge of linear comes very handy in analyzing and debugging why is the model not working" "oh I see this matrix is near singular, thats why my estimates are off the park", or "oh these two variables are very correlated, that is why gradient descent is having so much trouble converging fast", "ah I see why I am getting NaN here" etc etc.

EDIT: darkxanthos, appreciate your comment. I would say it is a bit like driving. Knowing the internal mechanics is neither necessary nor sufficient, and hardly correlated with good driving skills when things are going well. But sometimes when things are not going as expected, it helps in debugging.

Let me try and pique your interest: Note that the decision boundary of naive Bayes is actually a linear function of the log conditional probabilities considered all independent, with LA you can now also consider the case that they have dependence. Consider updating multi-armed bandit problems, the updates are variants of gradient descent, and its nature is indeed characterized by the eigenvalues of Hessian of the thing you want to optimize. Consider K-means clustering, one way to get very close to its global optimum is to solve the same cost function using linear algebraic updates (called spectral graph partitioning). By trig I think you have the dot-product of two vectors in mind, the related analysis actually does not rely much on trigonometric properties but heavily on the linear algebraic properties, in fact this what allows one to escalate affairs from simple linear feature vectors to extremely non-linear ones because even though they are nonlinear in the data space in some other space they are linear so people do the math in that space (called the kernel trick although I find that term quite silly) ..This thing, linear algebra, lurks everywhere, I tell you :)


Thanks for this comment. Just to be clear I am not saying it isn't necessary but am genuinely curious what I'm missing. I think a big influence on my comment is using ideas that touch on linear algebra and possibly even doing the computations one might end up doing via linear algebra but without knowing it.

For example- Least squares regression. Totally use this. Even took a semester in college on just regression. The linear algebra underpinnings though haven't never been shown except for a quick blurb in my linear algebra text book. I still understand the concepts of fitting a model and when it's a bad fit (such as non-normal distribution of residuals, co-linearity) but the theoretical underpinnings are more fuzzy to me.

Representing features as vectors, sure. But that's also a pretty superficial use of linear algebra since from that point forward I'm using something on the trig side to compute results (at least in clustering).

I also tend to lean rather heavily on probability and bayesian approaches to many areas. So Naive Bayes classification is a love of mine, finding ideal parameter values given data coming in becomes an online updating multi-armed bandit problem to me (which also doesn't require explicit linear algebra). A lot of my work is also in experiment design and analysis and for this I use a mixture of Bayseian and frequentist statistical testing.

Canned packages out of the box with parameters to tweak that I can cross validate to evaluate how well my model is working. If I happen to venture out to other models I'm probably reading up on common pitfalls and how to test for them.

To me, it's entirely possible that the gap between you and I is due to experience and even just differences in training/learning (including but not limited to the quantity of it). These discussions are important for me since they help to inform my future learning aspirations.


I love the driving analogy, it's quite fitting. Although I would argue that the analogy isn't complete. Knowing about mechanics would be very relevant if cars were not mass produced and you (or your small team) were building / modifying your own vehicle. Essentially, I think that most algos are not "off the shelf" standard yet, and they require a lot of building / fiddling.

In this scenario, only knowing about mechanics still wouldn't make you a great driver, but it would help you design a vehicle that works for you, and help identify vehicles that might blow up with you inside!

That having been said I use ML algos, and I suck at linear algebra. I often wish that my linear algebra classes hadn't been so dreadfully stale and boring.


"The reason I'm skeptical is because I believe in the science portion of our field's name. One of the primary things that separates a data scientist from someone just building models is the ability to think carefully about things like endogeneity, causal inference, and experimental and quasi-experimental design."

What exactly is a 'data scientist'? Shouldn't scientists be the ones analyzing their own data, instead of 'data scientists'?


My reply to this would be that everyone has to do some sort of statistics and modeling in their studies.

Data science tends to be more about having good software engineering, understanding how to interface with production systems to pull data out, and enough modeling experience, almost specializing in it, to be able to make inferences about different kinds of business activities affecting revenue, or other parts of the business.

You can get away and even grow in to a data science role as a statistican or software engineer (probably leading towards data engineering more than data science).

Source: students of mine get hired by companies like facebook[1].

So: to summarize, data scientists get hired for roles at companies to focus only on modeling, data quality, and data advocacy, and assisting product roles.

Edit:

[1]: http://zipfianacademy.com/


The first chapter of "Practical Data Science with R" answers exactly this. [1] In summary, you're designing and running experiments to help a business, so you're generally working for a non-scientist.

[1] http://www.manning.com/zumel/PDSwR_CH08.pdf


You'll need a Mac, some thick rimmed glasses, and an unshakeable belief that what normal people have been doing for 20 years with 2 clicks in Excel can in fact only be done on a Hadoop cluster "in the clouds".


In practice I find the bigger problem is from analysts/actuaries/statisticians who have a disdain for programming, which sometimes is viewed as a task for mere technicians.

Typically your excel model/analysis has not even solved half the problem of a datascience system. It needs to be repeatable, it needs to be open to change (source control!), it needs to be integratable with the wider system.

These things need to be considered upfront. There are plenty of reasonable software tools for this. Yes hadoop shouldn't be your first step, but taking 5 minutes to put something on a server in ec2 (omg, the cloud) is not unreasonable.

There is a swallowing abyss between excel and production. That is where datascience projects die, its a shame.


I've never met a statistician who either uses excel or has a "disdain for programming". R or Matlab are basic tools of the trade


I talk a lot of people who've had trouble with "data scientists" who are strong in statistics and know some matlab or R or something like that, but know nothing about the craftsmanship of programming.

By that I mean skills like using version control, writing software that is maintainable, working with a team that uses project management software, things like that.

A common kind of workflow is that a data scientist develops an algorithm and makes tweaks to it, and that this gets baked into a production system.

If the data scientist throws something over the wall and it takes the developers a few weeks to get it ready for real use, the "real time" productivity of the team is going to be awful. The closer we come to the data scientist checking the changes in and that's that, the more valuable the data scientist is.


This is absolutely a fair comment, coders but not software engineers, and is the same problem that's permeated bioinformatics for the last decade or so. (As an aside, it's fun hearing grand claims about data science revolutionising medicine in 10 years [0], when the same claims were made about bioinformatics 10 years ago.)

[0] https://twitter.com/HanChenNZ/status/473825783874859008


R and matlab are better, but those tools also have issues integrating into production depending on what you are doing. It's not so much the exact tool you use, but just having a little forethought about how your creation is going to interact with a production system.

A lot of people feel programming is undervalued in academia. For instance Hadley Wickham creator of ggplot2 probably hasn't gotten the recognition he deserves. With a prevailing attitude such as that is it any wonder academic code has such a poor reputation?

Whickham notes that he thinks tides are changing. I agree that it is, as a part of the datascience phenomenon. As part of the change you are going to see a few more macbooks, some cloud servers, maybe a guy with glasses talking about version control and software design. It is not all garbage, I hope you keep an open mind.

Q:Do you feel that the academic culture has caught up with and supports non-traditional academic contributions (e.g. R packages instead of papers)?

A:It’s hard to tell. I think it’s getting better, but it’s still hard to get recognition that software development is an intellectual activity in the same way that developing a new mathematical theorem is.[1]

1. http://simplystatistics.org/2012/05/11/ha/


Integrating in production is a huge biggey. I hope to be spending a lot of my time this summer sharing / educating folks abou some tech I've built to make putting interesting Analytics Into production.


I use matlab.

then I use excellink to send everything in matlab to excel.


You can use Excel for more than many people think. For some great examples, and with only a single chapter at the end devoted to R, see Data Smart, by MailChimp's Chief Data Scientist:

http://www.amazon.com/Data-Smart-Science-Transform-Informati...


I've just read through the sample chapter (ch1) on the book's page and found the style entertaining. Just Excel basics but obviously laying foundations for later chapters. I found I could duplicate results on LibreOffice ok as well (pivot tables close enough, solver works in same way, conditional formating clunky and limited to three ranges).

I've ordered the book.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: