Foundations of Data Science [pdf]

snippyhollow · on Oct 22, 2014

The second chapter "High-Dimensional Space" talks about the problem of spikey spheres[0] (how most of the mass is near the surface), I made an ipython notebook to illustrate it[1].

[0] http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres

[1] http://nbviewer.ipython.org/urls/gist.github.com/SnippyHollo...

princehonest · on Oct 22, 2014

"Please do not put solutions to exercises online as it is important for students to work out solutions for themselves rather than copy them from the internet."

I find crowdsourced solutions for honest autodidacts very valuable.

niels_olson · on Oct 22, 2014

Thanks, so true. If you're not immersed, as in a traditional program, you're stuck with catch-as-catch-can, which can be very inefficient. Things like learn Python the hard way are wonderful.

thomaskcr · on Oct 22, 2014

Anybody who doesn't read that first chapter to the end is going to be very confused.

> To make it easier to read we use E^2(1-x) for (E(1-x))^2 and E(1-x)^2 for E((1-x)^2).

Why change that notation? That seems to purposefully be introducing confusion.

On page 14 they don't use that notation (om^2(x+y) = om^2(x) + om^2(y) -- according to their notation note that should really be om^2(x+y) = (om (x+y))^2).

Not trying to knock what seems like a really neat introduction, I just don't understand the need for defining ridiculously unconventional notation and then not using it consistently introducing a lot of confusion.

PurplePanda · on Oct 22, 2014

I've seen this notation quite commonly.

I haven't looked at the link but based on your quote your comment about page 14 doesn't look right. The different notation doesn't change the number of times you need to write the operator.

Your new equation is just writing the same thing on each side, but using a different notation. It's like a=a. Whereas their equation is apparently giving an identity.

thomaskcr · on Oct 22, 2014

I've never seen it but I am trying to think back to all of those times I worked both directions on a proof and just shrugged in the middle =p.

Ah -- you're totally right on my last sentence. Thank you.

madcaptenor · on Oct 22, 2014

It's basically declaring operator precedence - saying "we're going to write things this way so we don't need so many parentheses". It's fairly analogous to writing sin^2 x for (sin x)^2, a notation which the intended audience is probably used to. (Although that notation creates its own trouble when you have sin^(-1) x ...)

asquidy · on Oct 22, 2014

This is cool, but how can you write a book about data science without mentioning causal inference or experimental design? Most people that do data science are not applying black box algorithms to clean data. They are actively manipulating and shaping the data, coming up with theories, and testing those theories. Inference is more important in theory and in practice for data scientists than theoretical models of graph formation and some of the other topics covered in this book.

kiyoto · on Oct 22, 2014

I find the title to be linkbaity and misleading (quite disappointing for decorated computer scientists like Hopcroft and Kannan).

Based on the table of contents, a more accurate title would be "Modern Foundations of Theoretical Computer Science with an Eye Towards Machine Learning", and even that is given a disproportionately large weight on machine learning.

coherentpony · on Oct 22, 2014

> I find the title to be linkbaity and misleading (quite disappointing for decorated computer scientists like Hopcroft and Kannan). > > Based on the table of contents, a more accurate title would be "Modern Foundations of Theoretical Computer Science with an Eye Towards Machine Learning", and even that is given a disproportionately large weight on machine learning.

What? That's the title of the book. And it's not linkbaity at all. Linkbaity would be something like, "Two decorated computer scientists just wrote a book about data science, and you'll never guess what's in it!". Or, "419 things you didn't know about data science."

Changing the title of the post doesn't change the title of the book.

cgio · on Oct 22, 2014

I believe that GP meant that the title of the book is linkbaity not that of the post.

andrioni · on Oct 22, 2014

The book is actually quite far from theoretical computer science, it is much more closer to a introductory book on numerical analysis and statistics for applications, which I find is quite in line with a book called "Foundations of Data Science". A bachelor's degree in applied mathematics usually covers around 70% of the book, with the rest being a bit of statistics (mostly around statistical learning and stochastic processes) and some sprinkles of discrete mathematics.

pjmorris · on Oct 22, 2014

Ullman is one of the authors of 'Foundations of Computer Science' [1], a well-respected (but out-of-print) introduction to CS theory. I suspect he's borrowing both the intent and the name.

[1] http://infolab.stanford.edu/~ullman/focs.html

devilsdounut · on Oct 22, 2014

Looks pretty academic. I see no mention of data cleaning or more practical considerations in the table of contents.

ehurrell · on Oct 22, 2014

"Foundations of" tends to be a common academic title start, so I would expect an academic approach. To not talk of cleaning at all seems like a serious error, from academic experience it does not take lone before your work requires data not covered by pre-cleaned datasets e.g. MovieLens or TREC.