Hacker News new | past | comments | ask | show | jobs | submit login
Foundations of Data Science (2018) [pdf] (cornell.edu)
177 points by dilawar on Jan 30, 2023 | hide | past | favorite | 30 comments



None of the comments here discuss anything about the book. They're just knee jerk responses, either saying the book is too mathematical, or that other books are better. I wish the quality of comments on HN was higher.

Here are some actual thoughts about the book:

- Having actually read a few chapters of this book, I find it to be poorly written on the whole. The chapters are extremely verbose and written in a bit of a circuitous and winding style. I find the explanations to be strained and not very pedagogical. They might be insightful if you already understand the material well, but since this aims at being a "foundations" book, the style of writing seems to miss the mark. I think these guys might have been the wrong people to write this book.

- I don't think the selection of topics makes any sense. It's a bit of a weird mish mash of topics which are arguably more or less fundamental, and have been popular at different times in recent years in the applied math, statistics, computational science, machine learning, and electrical engineering communities. If you are a student, it seems unlikely that you will benefit from taking a course taught out of this book, since you would be better served by taking more focused courses on the individual topics from this book that happen to be relevant to your research. If you are a researcher and need to learn one of the topics in this book, it's unlikely that this book is a better reference than any of the many existing books on these topics (or review papers). If you are self-studying, I think you are bound to fail with this book. You will benefit from something which is more thought through pedagogically.

Ultimately, I don't really see the point of this book.


I have to agree with both points you make, but especially the second. The selection of topics, and their ordering, is very weird. Like why is there a whole chapter (#4) on Markov Chains, sandwiched between a chapter on the SVD (#3) and an overview of Machine Learning (#5)? There has to be another way!

Most of the topics covered are interesting in their own way, but the pieces just do not add up -- it has the flavor of "every author contributes a chapter, ordered in round-robin style".

In particular: opening up with a whole chapter on the odd properties of high-dimensional spaces ("all the mass is at the edge", etc.) is interesting, but it doesn't work pedagogically. I've heard various talks, over the years, that present this family of results -- and it's nice to see so many of them collected here. But not as an introduction.

I'm doubtful that the subject of high-dimensional geometry (interesting though it is, our intuition in #dims = 3 does not scale correctly) provides frequent insight into why algorithms work or fail. (Happy to hear notable counterexamples on this.)

Name your favorite ML surprise: that weak learners can be converted into strong learners by boosting, that ANNs with 10^6 parameters can be trained with 10^4 data, that auto-encoder architectures work so well. If Chapter 1 is trying to shock the reader out of their complacency, that's probably where to start.

Also, this sentence from the last paragraph of the Introduction does not inspire confidence: "The term “almost surely” means with probability tending to one." Oops, that would be "equal to one."


Couldn't agree more. Your comment about the weirdness of high-dimensional spaces is funny. I think this sort of stuff gets deployed for its high "Gee whiz!" factor. Not that it isn't interesting in its own right, but I fail to see what kind of a purchase it's going to give you on "data science"...

The one example I know of re: insight is L1 regularization. The "spikiness" of the L1 unit ball in R^n for large n can be thought of as encouraging sparsity when using the L1 norm to regularize optimization problems.


> since this aims at being a "foundations" book, the style of writing seems to miss the mark

"Foundations" in the mathematical jargon doesn't mean "introduction to". It's an illustration of several foundational topic in data science, which is markedly different from an introduction to the field.

E.g. "foundations of computer science" is not "CS 101".

In this sense (and in this sense only) I don't think it misses the mark.

I disagree on the verbosity, statements have full proofs but that's expected from a "foundations" book. CLRS is waaaay more verbose, for instance.

I agree that I don't see a strong connection or logical path in the selection of topics.

Edit: Also agree about the comments. Wow this thread is bad. The "fuck math" crowd must have awakened.


I'm an undergraduate who took a course based on this book. While I agree that the book is not good at being a "foundations" book, I would say the ideas taught in the book are very helpful in building an intuition for data science. For example: n real numbers independently taken from a standard normal distribution, normalized to be length 1, forms a random vector on a n-dimensional sphere. While these facts are not significant in the sense that I could apply them to a problem directly or use them to prove a theorem, these concepts appear frequently in data science problems and I know immediately that a graph is not a random graph by testing one of the properties of a random graph.


> you would be better served by taking more focused courses on the individual topics from this book that happen to be relevant to your research.

Sounds like an overview. I don't think any "foundations" book is intended to be the one and only source of information. I think it's supposed to give you an idea of where you want to go next.


The problem with this is that if you need to be told where to go next you aren't going to be able to get there.

If you're in the research community, you won't need to be told where to go next. You will be surrounded by people who will know, and you will pick it up by osmosis. If you don't, you won't last long.

If you're outside the research community, knowing where to go next is unlikely to be of much use.

If you work at a company as an engineer, it's unlikely any of the material in this book will be of much use to you.


Judging by the content, a better title for this book might be "Mathematical foundations of data science". As has been repeated ad nauseum online by practitioners of this dark art, a defining fraction of what has come to be termed "data science" concerns data processes, data models, data quality etc. This is not just about semantic clarity: the range of algorithms and mathematical models that are applicable heavily depends on these adjacent data concerns.


Do you think there could be other titles that might be better suited to this work?


"Yet another partial introduction to some things in data science from an impractically mathematical perspective"


Fundamentals of Data Engineering by Reis and Housley covers those things very well.


This is about Data Science not about Data Engineering.

How does a Data Engineering book cover topics on Data Science?


As a practitioner with a research background I wouldn't recommend this based on the contents over something like Elements of Statistical learning. Generally speaking I feel that data science is missing an authoritative methodology book, well separated from books on algorithms or maths/CS foundations.


I think that is because data science is a recent term that has been coined to mean an increase of interest in the quantitative analysis of "data" across a diverse number of domains. To abstract what is common across all these fields is not a minor task... But achieving such an authoritative methodology book that will not be a cut-and-paste compilation of different strands is in some sense also an acid test whether the term has long term meaning or was simply a hype term during a particular speculative bubble period.


yeah i think that's accurate. "what they do" rather constrains a book about "how they do it" :-)


Meanwhile 90% of data science is sql + linear regression at best


In practice, linear regression is highly effective and likely to be better founded than your million parameter neural net.


This is also an official book from Cambridge University Press: https://doi.org/10.1017/9781108755528 .


Unreadable book. Sorry. I am a data science practitioner.

Edit: In my view, dificult to gain much from this.


Wow. Thoughtful and insightful comment right here.


The parent page by one of the authors links to a version which has been typeset more recently (in 2019).

See https://www.cs.cornell.edu/jeh/


And the second link on that page is to an even more recent version (2020): http://ttic.edu/blum/book.pdf


Anyone read the Data Science Design Manual by Skienna and find it useful? Curious if it’s a good intro source or something like Statistics in a Nutshell


Thank you. A bit out of date, but the fundamentals are there. This does not seem to be a practical guide, but goes into theory. A good catch for a physicist.


(the book is from 2020)


Ancient.


What is outdated?


nobody uses wavelets these days.


They're still being used in a variety of geophysical imaging applications, eg:

https://www.frontiersin.org/articles/10.3389/feart.2022.1011...

but I guess the cool kids are using something new.


Should we sometimes learn about history of (data) sciences instead then?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: