Hacker News new | past | comments | ask | show | jobs | submit login
CS109a: Introduction to Data Science – Resources (harvard-iacs.github.io)
223 points by gtsnexp on July 31, 2022 | hide | past | favorite | 25 comments



These notes might be a great source for what they cover, but as a whole I find this to be a good example of what is currently wrong with data science education. While the syllabus has bullet points that include "1. data collection", "2. data management", and "5. communication", the content and schedule have a 90%+ overlap with a standard machine learning course. They even use a statistical learning textbook (a good one, but still).

Statistics departments keep trying to latch on to the excitement (and money) around data science by changing the superfluous things like department names and course titles without actually adjusting what they teach. I would love to see a version of this that actually engages at a non-superficial level with topics such as database design, theory(ies) of data visualization, methods for storytelling with data, and interactive design.


>> would love to see a version of this that actually engages at a non-superficial level with topics such as database design, theory(ies) of data visualization, methods for storytelling with data, and interactive design.

I love these discussions and taxonomies in data science. So I have a few genuine/honest questions:

1) isn't what you said more "analytics" or "analytics engineering" oriented (which also and itself is a subtopic/subfield of data science) ?

2) I think that more and more people are trying to define what "data science" is, specially for marketing purposes, and then put it in a box, like any other science (i.e. chemistry - take an undergrad chemistry textbook and they will always cover the same topics). But since it isn't well defined yet, many different courses covers different algorithms/aspects of data science, so I think it end up looking superficial and hard to please everyone. Would you agree w/ that? For ex. I'm trying to find a good and in depth course that applies Data Science/Machine Learning in Big Data problems, but I just can't find any serious course covering it.


I completely agree that it's an open question about what exactly constitutes data science and what should (or at least could) be covered in a standard introduction. For me, a fairly reasonable—though certainly not definitive—set of topics are five items listed on this course's syllabus. And that's what makes this so frustrating, personally. The instructors actually have a good proposal of what should be taught, but then just turn around and teach a classical course in statistical learning.


the content and schedule have a 90%+ overlap with a standard machine learning course

Note that neural networks are not even mentioned in the content. This is not a good course to learn modern ML.


At the time the comment was made the link was https://harvard-iacs.github.io/2019-CS109A/pages/materials.h... where neural networks were mentioned.

See https://news.ycombinator.com/item?id=32295656


The other topics you mentioned aren’t exactly classified as “data science” so you likely won’t see them in most university data science courses. Database design has its own course usually but I’ve seen more of the rest as part of college/certificate programs.


The data scientists I've worked with definitely do data visualization and storytelling with data. (Schema design, not so much...)


You're thinking too narrowly about what "schema design" could mean. No, data scientists do not typically design back-end, production database systems. But defining and organizing a multi-sheet spreadsheet for manual data collection is what many data scientists spend much of their time doing (i.e., in the biomedical space). Doing that well definitely requires some understanding of concepts such as functional dependency, normal forms, and data types.


This seems to be a tremendous amount of material to cover - with associated programming exercises to boot - for a course that requires only intro courses in CS and Statistics as prerequisites. So one does wonder how superficial it might be and how much students adhere to the warning about Google usage. Or perhaps Harvard students truly are that smart and hard-working that they can manage to go deep into all this material while managing with the rest of a full course load! https://harvard-iacs.github.io/2021-CS109A/pages/syllabus.ht...


The way that this course and many others in certain Harvard departments (CS, Math, Physics) work is that it designed to feel like a boot holding you under the water while you struggle to breathe, but when the semester is over you feel like you learned something.



Do you have a link to the videos that are mentioned as being posted on the site in the syllabus. I can't get much out of the introduction slides and the format seems to be more geared towards speaking rather than the content of the slides, which is fine, it would just be better with the speaker.



Ok, we've changed to that from https://harvard-iacs.github.io/2019-CS109A/pages/materials.h..., since it looks like the most recent version of the same page. The home page of the course seems relevant too.

Thanks to you both!


Why are non of the newer versions available? I can't get this to load.


Video recordings of the lectures seem to require access to Harvard's Canvas platform? Is it possible for outsiders to watch them?


It appears so.

https://harvard-iacs.github.io/2021-CS109A/pages/syllabus.ht...

"If you would like to audit the class, please send an email to the Helpline indicating who you are and why you want to audit the class. You need a HUID to be included to Canvas. Please note that auditors may not submit assignments for grading or make use of other limited student resources such as office hours."


I'm new in this field and one thing I have a hard time understanding is how to apply all these ML algorithms, python libraries, etc. on very large data (i.e. how to deal with the memory problem, etc.). If someone could point me to links and/or hands-on courses I would really appreciate it.


You use versions of the algorithms that have been rewritten to be parallelized. For example:

https://spark.apache.org/mllib/

There are a lot of techniques where this won't be possible due to the nature of the algorithm.


Thank you.


Is there a similar class online with PyTorch?


FastAI's machine learning for coders: https://course.fast.ai It's amazing.


link: https://course.fast.ai/

discussion from 9 days back about the 2022 version: https://news.ycombinator.com/item?id=32186647


Couldn’t find videos. Anybody have luck with that ?


Keras... Yikes




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: