Pen and paper exercises in machine learning (2021)

icedistilled · on June 29, 2022

The intro speaks truth and brings back a lot of frustrating experiences:

>We may have all heard the saying “use it or lose it”. We experience it when we feel rusty in a foreign language or sports that we have not practised in a while. Practice is important to maintain skills but it is also key when learning new ones. This is a reason why many textbooks and courses feature exercises. However, the solutions to the exercises feel often overly brief, or are sometimes not available at all. Rather than an opportunity to practice the new skills, the exercises then become a source of frustration and are ignored.

Typical exercise soltion: How to draw an owl. 1. Draw some circles. 2. Draw the rest of the $@#% owl.

hintymad · on June 29, 2022

> We may have all heard the saying “use it or lose it”. We experience it when we feel rusty in a foreign language or sports that we have not practised in a while

Another sad thing is that we lose our memory in a step function. We remember something for a long time without using it, and then all of sudden it's gone from our memory. That may explain why many lifers in Google could interviewed so badly. They joined Google when there was no leetcode and when Google had amazing high standards in asking math and algorithm puzzles and novel systems design questions. So, they are really good. They are also confident because they achieved a lot and led amazing projects in Google. Yet when they started to answer interview questions, they struggled with basic facts.

melling · on June 29, 2022

They can work on amazing projects but they “struggled with basic facts”

Sort of makes one wonder how important those “basic facts” really are.

feet · on June 29, 2022

Yea because people never forget important things? Really?

That's not how memory works, in fact the more we recall something the more we start misremembering associated details

SyneRyder · on June 29, 2022

> Another sad thing is that we lose our memory in a step function. We remember something for a long time without using it, and then all of sudden it's gone from our memory.

Not sure memory loss is quite that binary. Heading for mid-40s here, and often things are still there, but it's higher latency to fully recall it. A bit like it's stored off in Amazon Glacier. Or often I can get the first byte quickly ("I know that person's surname starts with a P") but retrieving the full answer takes longer and some brute force iterations. It's as if I've hit a hash-table collision and need to binary-compare the results, or I've reached a node in a tree-structure that has many branches ("Is it Pfeffle? Piper? No, Pfeiffer!")

grubby · on June 29, 2022

best description i’ve ever heard, people I know in mid 40s describe it like that without the technical similes but it makes much more sense from u

oldsecondhand · on June 29, 2022

> They joined Google when there was no leetcode and when Google had amazing high standards in asking math and algorithm puzzles

How was that any better? Leetcode is math and algorithm puzzles.

hintymad · on June 29, 2022

The point is that they didn't cram. They were just talented in math or CS, or they were geeky enough to be well read and knew lots of techniques.

stared · on June 29, 2022

I expected some in that line: https://generalabstractnonsense.com/2017/03/A-quick-look-at-...

That is - drawing things on paper, no formulae. I used to do similar exercises with the printed Iris dataset and giving people a transparent foil to draw the classifier. Then, giving them another sheet of paper with the validation dataset. The exercise was loved by people from high-school students to managers.

I developed an interactive version, https://github.com/stared/which-ml-are-you. Unfinished as a game, but it works.

petercooper · on June 29, 2022

Same, but thanks for that link! When I saw "pen and paper," I thought "visual" for some reason I now can't easily defend.

throwawayarnty · on June 29, 2022

I always thought that textbooks never gave detailed solutions because of space limitations.

Digital pdf is a great way to break free of these limits and this document is impressive indeed with the detailed solutions.

_t4za · on June 29, 2022

I'd suspect a big reason is that the publisher sells the detailed solutions in a separate Solutions Guide

jmeister · on June 29, 2022

No it's actually because instructors like to assign the problems to students.

arcanus · on June 28, 2022

Some good stuff here! I'm surprised they didn't show many analytical results in neural networks. For example, I like making candidates for research in deep learning derive back propagation. You can show a wide variety of interesting results in single neuron models as well.

ipunchghosts · on June 28, 2022

Do these kinds of "tests" really translate to being a good researcher? I'm asking as an ee phd.

jenny91 · on June 28, 2022

In a lot of academic fields it's assumed that a researcher really understands their stuff to first principles. I think being able to derive back prop is a really straightforward exercise and definitely you should expect a researcher in the area to be able to do it off the cuff. I think it's akin to fizz buzz; it'll weed out people who really don't know what they're doing but it won't tell you too much about those who do it without trouble.

hintymad · on June 29, 2022

I thought so too, until my friend told me that CIT PhD and an applied research scientist in a FANNG company derived an incremental Gaussian mixture model without using the property of GMM at all, and another CIT PhD in the same team defended the algorithm by saying something like "but the intuition is correct". I couldn't believe my ears.

ogogmad · on June 29, 2022

I don't understand. What's CIT?

hintymad · on June 29, 2022

Caltech

ausbah · on June 29, 2022

but most researchers don't use first principles day to day, like others in this thread -have said - if you don't use it you lose it. researchers don't utilize the details of back prop in there day to day work, so expecting an off the cuff derivation isn't a fair assessment of what makes a good researcher

jenny91 · on June 29, 2022

Researchers absolutely do use first principles day to day. I suppose you have a very lax definition of "researcher" if you don't think they do?

PeterisP · on June 29, 2022

Researchers specialize and as such even in a "single field" they work at very different levels of abstraction; one subfield will care about building up some novel construction from first principles and then another subfield will want to use that construction as a basic axiomatic building block and abstracting from the details. E.g. execution performance optimization of a known formula is orthogonal to developing a better formula, we want people working on both these aspects, these are going to be different people who each build on their own subfields first principles that don't overlap much with the first principles of the other researcher.

ausbah · on June 29, 2022

underlying their work sure, but I doubt many ML researcher will be recounting the definitions of items like backprop day to day

NavinF · on June 29, 2022

In my experience analytical results have very little impact on practical deep learning. Examples:

- Everyone “knows” that the Adam optimizer’s proof is incorrect, but we still use Adam because we don’t want to redo hyperparameter search with a different optimizer that’s proven to converge but probably performs worse.

- Everyone “knows” that the Wasserstein loss for GANs has a better convergence proof, but nobody uses it because the generated images look like crap compared to what you get from stylegan* with their default config.

It’d be nice if ML proofs led to better performance, but that’s not often the case. I see far more progress from better data preprocessing and from bringing in knowledge from other fields like signal processing.

gmadsen · on June 28, 2022

that seems like a very basic thing that would be on a quals exam. a good researcher is a combination of smart/creative with expert knowledge of their field down to the fundamentals.

as an ee phd can you derive basic control theory or electromagnetic relationships?

ipunchghosts · on June 29, 2022

Not since my quals!

p1esk · on June 29, 2022

See Section 2.5

neophocion · on June 28, 2022

Omg what a work of art!

Only comment is that it seems quite heavily focused on graphical models, more bayesian/nn concepts would be great to see in this!

neophocion · on June 28, 2022

* and bandits/RL

eachro · on June 28, 2022

I still have yet to use graphical models (in the traditional sense, not including the new age variational inference style neural networks as graphical models) in real life. Am I just completely missing something? Where do people generally find compelling uses for graphical models?

fritzo · on June 28, 2022

I use graphical models all over the place, typically for problems that have more structure than simple statistics calculations, but don't need the huge capacity of machine learning models. For example I work with bio folks measuring "EC50" values, basically parameters of titration curves in a wet lab. It seems like a simple curve fitting problem with say 5 parameters. But then these wet lab scientists measure hundreds of curves at once, so we want to put hierarchical structure among the curves -- that is all the curves should look pretty similar with only a few degrees of freedom. Graphical models are a great framework for expressing prior knowledge about the dependencies between these curve parameters. But yeah I then do inference in graphical models using variational inference in PyTorch.

gmquestion · on June 29, 2022

Is there a resource you would recommend for learning about applying GM and related techniques to data like those you described (similar structures with N DoF)? As a hobbyist I've been toying with symbolic regression and this feels like a wall I've been running up against

yellowcake0 · on June 29, 2022

They are used quite a bit in computational genomics. Indeed, genomics is full of latent variable problems where one has a good model for the underlying phenomena but often not a lot of labeled data.

civilized · on June 29, 2022

That was a really computation-intensive, brute-force derivation of grad log |det W|.

I know there is a more elegant approach that makes better use of the determinant's core properties, but it's been ages since I saw it.

kxyvr · on June 29, 2022

There is. My recollection is that it's easiest to use the adjugate like they do on this question:

https://math.stackexchange.com/questions/38701/how-to-calcul...

Though, there are other ways to derive it. My personal opinion is that the vector and matrix calculus derivations in the book are too verbose, but this style may be more comfortable for some readers. My personal opinion is that the semidefinite and cone optimization communities have more concise ways of deriving these kind of derivatives and relationships. For example, this can be seen in Boyd and Vandenberghe's Convex Optimization or Ben-Tal and Nemirovski's Lectures on Modern Convex Optimization.

civilized · on June 29, 2022

Yes! The top answer is basically "use the Jacobi formula", and the Jacobi formula can be derived incredibly elegantly using the chain rule: https://en.wikipedia.org/wiki/Jacobi's_formula#Via_Chain_Rul...

yellowcake0 · on June 29, 2022

Writing det(W) in terms of an implicit parametrization of the eigenvalues is elegant, I think it's unlikely there's a better approach.

Brute-forcing it would be to write down the multivariate polynomial in full generality, as there would be no way to break it up using the logarithm.

jarvist · on June 29, 2022

Well write it up and put it on the ArXiV / send it to the author for version 2!

sedatk · on June 28, 2022

Direct link to PDF: https://arxiv.org/pdf/2206.13446.pdf

Noumenon72 · on June 29, 2022

I almost bounced off this link because none of the tabs have any content in them. I clicked through to the GitHub and it also doesn't tell you where to get started. Finally I came back and found the PDF. Maybe the title could be "Pen and paper exercises in machine learning (PDF)" to help you know what you're looking for?

sbierwagen · on June 29, 2022

Direct links to arxiv PDFs are discouraged, since they are immutable. If the author uploads a v2 to correct an error, your link will still point to the old version.

spekcular · on June 29, 2022

No, that link will always point to the most recent version.

For example, https://arxiv.org/pdf/1909.03562.pdf gives version 5 of https://arxiv.org/abs/1909.03562, as noted in the left margin of the first page.

sedatk · on June 29, 2022

Even if it was a link for a specific version, I'd still post it as arxiv's landing page isn't designed for those who need to jump to the content right away. It's extremely cluttered and hard to navigate. I couldn't care less if the PDF got some arbitrary fixes in an unknown future. That'd be arxiv's problem.

sbierwagen · on June 29, 2022

Oh whoops.

ivan_ah · on June 28, 2022

Nice. I only looked at the linear algebra parts (first 20 pages), but very impressed with the detailed solutions.

fn-mote · on June 29, 2022

I would say it depends a lot on your background. The whole thing is very detailed, but ideas can be lost in detail-oriented proofs.

Reading the first few sections, it seems that the ideas are there - especially in the proofs - plenty of motivating ideas, and the kind of "raw index crunching" that the paper begins with gives way to more ideas. Doubters might read section 1.6 about the power method for finding the largest eigenvalue. It convinced me that the ideas were worth reading.

ivan_ah · on June 29, 2022

> section 1.6 about the power method

Yeah 1.6 is a really cool exercise! https://arxiv.org/pdf/2206.13446.pdf#page=19

It's so cool to see of why this works (as an engineer I learned about power method with handwaving explanation "it works in the limit" but I never knew why it works).

So what do we do if we want u_2, the eigenvector that corresponds to lambda_2 ? Math overflow says we can just subtract the u_1 subspace from A [1] then repeat, but would that be numerically stable? (i.e. will that work with floats?)

[1] https://math.stackexchange.com/questions/1114777/approximate...

fernly · on June 28, 2022

I am at a loss to understand how these constitute _machine_ learning? The preface says "the exercises are ideally paired with computer exercises..." but I am at a loss to imagine what such computer exercises would look like. Somebody ELI(at least 10)?

jackpirate · on June 28, 2022

These exercises are writing mathematical proofs that basic machine learning algorithms behave correctly. They are "pen and paper" not because you are manually solving a large equation that a machine would normally solve, but because we don't have automated theorem provers capable of proving interesting machine learning theorems. I would expect a typical 1st year grad student to be using a resource like this.

If you don't understand the purpose of proofs, then this resource is not aimed at you.

hintymad · on June 29, 2022

Your question is a fair one and does not deserve to be downvoted. I'm just curious from what sources you learned about machine learning? Even in this world of using libraries to get things done, I have yet to see machine learning books or courses not touching math at all.

anewpersonality · on June 29, 2022

Will this help me finally grok ML?

New_California · on June 29, 2022

No, this is not a hands-on introduction to ML.

plaguepilled · on June 29, 2022

This is a great find - cheers!

jmatthews · on June 29, 2022

this is shockingly good. to add more, I believe I am the use it or lose it dude.

MikeDelta · on June 28, 2022

It it quite detailed, impressive. Brave use of the title font as well.