Foundations of Data Science (2018) [pdf]

sfpotter · on Jan 30, 2023

None of the comments here discuss anything about the book. They're just knee jerk responses, either saying the book is too mathematical, or that other books are better. I wish the quality of comments on HN was higher.

Here are some actual thoughts about the book:

- Having actually read a few chapters of this book, I find it to be poorly written on the whole. The chapters are extremely verbose and written in a bit of a circuitous and winding style. I find the explanations to be strained and not very pedagogical. They might be insightful if you already understand the material well, but since this aims at being a "foundations" book, the style of writing seems to miss the mark. I think these guys might have been the wrong people to write this book.

- I don't think the selection of topics makes any sense. It's a bit of a weird mish mash of topics which are arguably more or less fundamental, and have been popular at different times in recent years in the applied math, statistics, computational science, machine learning, and electrical engineering communities. If you are a student, it seems unlikely that you will benefit from taking a course taught out of this book, since you would be better served by taking more focused courses on the individual topics from this book that happen to be relevant to your research. If you are a researcher and need to learn one of the topics in this book, it's unlikely that this book is a better reference than any of the many existing books on these topics (or review papers). If you are self-studying, I think you are bound to fail with this book. You will benefit from something which is more thought through pedagogically.

Ultimately, I don't really see the point of this book.

mturmon · on Jan 31, 2023

I have to agree with both points you make, but especially the second. The selection of topics, and their ordering, is very weird. Like why is there a whole chapter (#4) on Markov Chains, sandwiched between a chapter on the SVD (#3) and an overview of Machine Learning (#5)? There has to be another way!

Most of the topics covered are interesting in their own way, but the pieces just do not add up -- it has the flavor of "every author contributes a chapter, ordered in round-robin style".

In particular: opening up with a whole chapter on the odd properties of high-dimensional spaces ("all the mass is at the edge", etc.) is interesting, but it doesn't work pedagogically. I've heard various talks, over the years, that present this family of results -- and it's nice to see so many of them collected here. But not as an introduction.

I'm doubtful that the subject of high-dimensional geometry (interesting though it is, our intuition in #dims = 3 does not scale correctly) provides frequent insight into why algorithms work or fail. (Happy to hear notable counterexamples on this.)

Name your favorite ML surprise: that weak learners can be converted into strong learners by boosting, that ANNs with 10^6 parameters can be trained with 10^4 data, that auto-encoder architectures work so well. If Chapter 1 is trying to shock the reader out of their complacency, that's probably where to start.

Also, this sentence from the last paragraph of the Introduction does not inspire confidence: "The term “almost surely” means with probability tending to one." Oops, that would be "equal to one."

sfpotter · on Jan 31, 2023

Couldn't agree more. Your comment about the weirdness of high-dimensional spaces is funny. I think this sort of stuff gets deployed for its high "Gee whiz!" factor. Not that it isn't interesting in its own right, but I fail to see what kind of a purchase it's going to give you on "data science"...

The one example I know of re: insight is L1 regularization. The "spikiness" of the L1 unit ball in R^n for large n can be thought of as encouraging sparsity when using the L1 norm to regularize optimization problems.

qsort · on Jan 30, 2023

> since this aims at being a "foundations" book, the style of writing seems to miss the mark

"Foundations" in the mathematical jargon doesn't mean "introduction to". It's an illustration of several foundational topic in data science, which is markedly different from an introduction to the field.

E.g. "foundations of computer science" is not "CS 101".

In this sense (and in this sense only) I don't think it misses the mark.

I disagree on the verbosity, statements have full proofs but that's expected from a "foundations" book. CLRS is waaaay more verbose, for instance.

I agree that I don't see a strong connection or logical path in the selection of topics.

Edit: Also agree about the comments. Wow this thread is bad. The "fuck math" crowd must have awakened.

renonce · on Jan 30, 2023

I'm an undergraduate who took a course based on this book. While I agree that the book is not good at being a "foundations" book, I would say the ideas taught in the book are very helpful in building an intuition for data science. For example: n real numbers independently taken from a standard normal distribution, normalized to be length 1, forms a random vector on a n-dimensional sphere. While these facts are not significant in the sense that I could apply them to a problem directly or use them to prove a theorem, these concepts appear frequently in data science problems and I know immediately that a graph is not a random graph by testing one of the properties of a random graph.

pessimizer · on Jan 30, 2023

> you would be better served by taking more focused courses on the individual topics from this book that happen to be relevant to your research.

Sounds like an overview. I don't think any "foundations" book is intended to be the one and only source of information. I think it's supposed to give you an idea of where you want to go next.

sfpotter · on Jan 30, 2023

The problem with this is that if you need to be told where to go next you aren't going to be able to get there.

If you're in the research community, you won't need to be told where to go next. You will be surrounded by people who will know, and you will pick it up by osmosis. If you don't, you won't last long.

If you're outside the research community, knowing where to go next is unlikely to be of much use.

If you work at a company as an engineer, it's unlikely any of the material in this book will be of much use to you.

college_physics · on Jan 30, 2023

Judging by the content, a better title for this book might be "Mathematical foundations of data science". As has been repeated ad nauseum online by practitioners of this dark art, a defining fraction of what has come to be termed "data science" concerns data processes, data models, data quality etc. This is not just about semantic clarity: the range of algorithms and mathematical models that are applicable heavily depends on these adjacent data concerns.

hacker_junky · on Jan 30, 2023

Do you think there could be other titles that might be better suited to this work?

usgroup · on Jan 30, 2023

"Yet another partial introduction to some things in data science from an impractically mathematical perspective"

schnitzelstoat · on Jan 30, 2023

Fundamentals of Data Engineering by Reis and Housley covers those things very well.

victor106 · on Jan 30, 2023

This is about Data Science not about Data Engineering.

How does a Data Engineering book cover topics on Data Science?

usgroup · on Jan 30, 2023

As a practitioner with a research background I wouldn't recommend this based on the contents over something like Elements of Statistical learning. Generally speaking I feel that data science is missing an authoritative methodology book, well separated from books on algorithms or maths/CS foundations.

college_physics · on Jan 30, 2023

I think that is because data science is a recent term that has been coined to mean an increase of interest in the quantitative analysis of "data" across a diverse number of domains. To abstract what is common across all these fields is not a minor task... But achieving such an authoritative methodology book that will not be a cut-and-paste compilation of different strands is in some sense also an acid test whether the term has long term meaning or was simply a hype term during a particular speculative bubble period.

usgroup · on Jan 30, 2023

yeah i think that's accurate. "what they do" rather constrains a book about "how they do it" :-)

bcd3169 · on Jan 30, 2023

Meanwhile 90% of data science is sql + linear regression at best

scarmig · on Jan 30, 2023

In practice, linear regression is highly effective and likely to be better founded than your million parameter neural net.

auggierose · on Jan 30, 2023

This is also an official book from Cambridge University Press: https://doi.org/10.1017/9781108755528 .

unixhero · on Jan 30, 2023

Unreadable book. Sorry. I am a data science practitioner.

Edit: In my view, dificult to gain much from this.

sfpotter · on Jan 30, 2023

Wow. Thoughtful and insightful comment right here.

phkx · on Jan 30, 2023

The parent page by one of the authors links to a version which has been typeset more recently (in 2019).

See https://www.cs.cornell.edu/jeh/

nicklaf · on Jan 30, 2023

And the second link on that page is to an even more recent version (2020): http://ttic.edu/blum/book.pdf

dayvid · on Jan 30, 2023

Anyone read the Data Science Design Manual by Skienna and find it useful? Curious if it’s a good intro source or something like Statistics in a Nutshell

Pinegulf · on Jan 30, 2023

Thank you. A bit out of date, but the fundamentals are there. This does not seem to be a practical guide, but goes into theory. A good catch for a physicist.

thoi423uyo4i32 · on Jan 30, 2023

(the book is from 2020)

scarmig · on Jan 30, 2023

Ancient.

frozencell · on Jan 30, 2023

What is outdated?

turing_complete · on Jan 30, 2023

nobody uses wavelets these days.

defrost · on Jan 30, 2023

They're still being used in a variety of geophysical imaging applications, eg:

https://www.frontiersin.org/articles/10.3389/feart.2022.1011...

but I guess the cool kids are using something new.

frozencell · on Jan 31, 2023

Should we sometimes learn about history of (data) sciences instead then?