Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning Guides (developers.google.com)
708 points by _ntka on July 23, 2018 | hide | past | favorite | 54 comments



Unlike most guides I've seen about ML, this one does a good job of focusing on developing and deploying a simple model first, then iterating. There are also lot of practical tips here, especially around feature engineering.

> the second phase of machine learning involves pulling in as many features as possible and combining them in intuitive ways. During this phase, all of the metrics should still be rising

As Google points out, after you build an initial model, the next step to increase accuracy is to perform feature engineering. They explain that this can be done manually or automatically using something like deep learning. Another option that people here might consider is using a library like Featuretools (https://github.com/featuretools/featuretools) for "automated feature engineering". Note: I am one of the developers.

Our goal is to help you increase the performance of your models without sacrificing the interoperability of your features. We have a post up about how our algorithm works here: https://www.featurelabs.com/blog/deep-feature-synthesis/. There are also plenty of real world demos on our website: https://www.featuretools.com/demos


Your article is cool. I have some things about information of machine learning. just check it out http://tricks4321.blogspot.com/2018/07/machine-language-for-...


One of my random question is that what does Google gain by spending resources on developing course like this? Do they want more people to do machine learning as there is a short age of developer with this skill in the market or is there something else involved in the mix?

Secondly, for some reason data science just doesn't excite me as much as typical software development goes. Like, why am I not excited enough to go down the path of specializing in data science in field of machine learning? Even if there is more money in it, I'm still not extremely motivated to learn it.

What i do particularly enjoy is good ol' back end web development. I don't have a degree in computer science but working on a information system degree with focus on "programming", I dream/working my ass to become cult of "software engineer" type II, a sophisticated software developer/programmer. I love building layers, optimizing code, learning new tools, algorithms data structure (without knowing math), creating unit tests, following programming paradigm. It excites me so much. And my core skills to dive into is block chain.. I love studying that topic too and all the algorithms it comes with it.

But when I see data science, no excitement. All I imagine is image manipulation and fancy charts. I know I sound a bit ignorant but, that's how it is.


> One of my random question is that what does Google gain by spending resources on developing course like this?

Mindshare or more generally PR. Also to "collect" the talent on their platforms (Tensorflow, Google Cloud, ...). Also these guides were repurposed from existing (internal) guides and are a few years old by now, so the cost is low.

You further describe the role of a data engineer or ML engineer. If you'd approach data science with a focus on engineering and tool use, you could be one of the few dangerous data scientists that is able to go end-to-end (should be safe for at least 5 years when such pipelines are evolved without much human intervention).

> But when I see data science, no excitement. All I imagine is image manipulation and fancy charts.

This is because, while there is legit substance to the hype, the hype is real and it is focused on deep learning ImageNet (and later GAN's, Atari games, Go). Being able to show deepdreamed images and cat neurons is like catnip to journalists. Computer vision is but a very small part of ML and lots of data-driven companies have no need for such skills. Charts are made by analysts.

Everything (including block chain) will move closer to ML paradigm of learning software. Data infra engineers will see their infra increasingly used for ML. It remains all software (very advanced, but accessible to anyone) and hardware (still a asymmetry here between industry lab and practitioner). Don't get left out: Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.


Great and honest points.

>Secondly, for some reason data science just doesn't excite me as much as typical software development goes

Fair enough. Part of the reason is "data science" has been so jammed pack of nonsense and people who don't do the actual work of building things, as you describe below.

> What i do particularly enjoy is good ol' back end web development. I don't have a degree in computer science but working on a information system degree with focus on "programming", I dream/working my ass to become cult of "software engineer" type II, a sophisticated software developer/programmer. I love building layers, optimizing code, learning new tools, algorithms data structure (without knowing math), creating unit tests, following programming paradigm. It excites me so much. And my core skills to dive into is block chain..

Ok this makes sense. But I'd be worried about 5 years from now. When all the little gears and things that go on in backend becomes a commodity (or abstracted away in the "cloud"), what are you going to do?

> I love studying that topic too and all the algorithms it comes with it.

That spark of interest in the algorithms, (which is just about logic, which is what math is basically about in the end), is basically the essence of what makes "Data science" so attractive.


"But I'd be worried about 5 years from now. When all the little gears and things that go on in backend becomes a commodity (or abstracted away in the "cloud"), what are you going to do?"

Well, over the last 8 years or so I started out in a similar kind of place, and have gotten quite good at building CRUD and business logic and glue, and fixing crap on the front end, and configuring servers.

Maybe I can stand in for the OP a few years down the line?

Over the last quarter, I've been splitting my time between things like linux admin automation and a set of pre-calculus core classes.

To answer your question on my personal scale, my whole ability to do this kind of work with my mediocre CS education (my BA is in Philosophy, and my PhD work is in Lit) is premised on leveraging the points in the systems where "all the little gears and things that go on in backend have [become] a commodity"... hence I just integrate ERP systems with WordPress or try and clean up some business's AWS drupal hosting setup some crap like that. That's been a fun and rewarding conjunction of my love for systems and the commodification of parts of IT/ programming work.

My hope is that by the time all the little bits of these data science topics become "abstracted away" over the next couple of years, I will understand the general underlying things well enough to use them. But who knows if that is a good bet or not... certainly not me.

However, it feels perfectly fine to learn things like math... I'm way, way better at it than I was as an undergrad 20 years ago and so it's quite a lot more fun for me. It's not like knowing some math has no application outside of this narrow field.

I dunno if my personal answer (keep learning, and enjoy fixing crap) matches the OP or helps extend your points/ question, but I've been getting a lot of fun (and some money) out of following my answer.


I think I have an answer to that first question; Altruistically I'd like to think its to help facilitate more ml engineers and scientists. Realistically, amongst the other reasons noted by everyone else, its a way to attract enterprise users to their technology & invariably their cloud.

Consider a larger organization (1000+ people perhaps), if groups within that org can train their people with these materials or even send them to Google to be trained in this subject matter they can come back with a nice shiny credential. Whether that ultimately becomes useful to that individual or the group is up to them but really it helps google foster that relationship with the main organization to eventually snag higher contract values.

That probably made no sense, but I thought I'd give my two cents (however crummy they might look).


> One of my random question is that what does Google gain by spending resources on developing course like this?

s/Google/someone at Google/

20% time leaves discretionary time for people who're motivated to get something like this started. Official approval may come along the way.


Everyone I know at Google says 20% time comes on top of 100% time these days


Do they want more people to do machine learning as there is a short age of developer with this skill in the market or is there something else involved in the mix?

They want to sell TPUs, this is part of generating the demand.


> What i do particularly enjoy is good ol' back end web development.

By all means, keep at it! Better to be an exceptional backend dev than average ML engineer. No one can predict the future anyway. It's certainly possible that the ML job surge is gonna stop abruptly when most of the advances have been captured by APIs.


Regarding your latter part, do read the “define: CTO OpenAI” (don’t have link I’m on mobile) - author has fascinating insights on just how important engineering of the specifics you describe is, for ML work to progress and show results.


An ML/data engineer tasked with productizing a data pipeline still does all of those - building layers, optimizing code, learning new tools, algorithms data structure, creating unit tests, following programming paradigms.


Perhaps it's somewhat off-topic, but I've built a spam detector similar to the article's withOUT using "direct" AI, but rather via a key-word or key-phrase "ranker". A simplified example is given below.

The advantage over other techniques is that one can easily trace the exact math of a conclusion, and tune it as needed. The disadvantage is that one probably has to manually tune it all rather than let the machine "learn". However, a hybrid approach could be used whereby "pure" AI suggests words and phrases to encode.

     rule.addList("nigerian, prince", rank=7);
     rule.addPhrase("great opportunity", rank=5);
     rule.addPhrase("lisa smith", rank = -4); // probably good
Here a "list" means that the word order doesn't matter, but with a "phrase" it does matter. A negative value means its less likely to be spam, usually because it's specific to your business or task. Actually I had multiple categories rather than just "spam" versus "non-spam", but that would complicate the example. I also used a database. One could perhaps call it a "weighted" version of MS-Outlook's rule engine. Somebody had a similar idea: http://dergipark.gov.tr/download/article-file/45302


You're essentially doing a rough manual version of Bayesian classification on n-grams (which is still very explicable): http://www.paulgraham.com/spam.html


The idea of my approach was that a "power user" could add the rules and scores without having to understand something that may take a while to explain. A scoring sheet can be displayed for a given message that would make sense to just about anybody with an associate degree. Example scoring sheet for a given message:

     Category: Spam
       Rule-ID    Score
       ----------------
       NgrPrnc1       7
       bPills         5
       knownPeople   -3      
         Total:       9 Threshold Exceeded!

     Category: Tech Support
       knownWidgets   3
       offer1        -2
         Total:       1 Insufficient total

     Category: Etc...
One could click on the rule-ID as a hyperlink to see specifics of a given rule (if details don't fit on screen).


This is how people did things back in the time. "Expert systems" with hand-crafted rules, built by "experts".

From the past, we learn that these systems are brittle and break continuously. For example, what happens when spammers start using different words, or send legitimate looking emails that are actually spam? Do you think you can build rules to catch 70%, 80%, 90% or 99.99% of spam?

If your goal is simply showing the rules being applied, you can still learn the rules with ML but display them in this way (for example GP suggested looking at Naive Bayes which was the most common method used to fight spam; I'd also point you to decision trees which are easy to visualize).


As stated, it wasn't intended for an "expert", but a power user. Somebody has to make the decision anyhow of spam versus non-spam in order to make a training set for "learning" based AI. These days you can purchase spam detection systems/services such that training such systems in-house is usually not worth it. They can use rejected messages from thousands of orgs to train their system.

But what I described had additional purposes such as sub-routing to various departments. It was a multi-purpose email categorizer in the early days of spam. Each approach has trade-offs. I'm not sure how you'd apply a "decision tree" using weights in a way that makes sense to a power user. A non-weighted decision tree seems too blunt an instrument. One generally needs multiple "clues" (factors) voting in tandem.


These guides also give good heuristics on how to look at data before throwing a model at it, and deciding what's the most logical model approach/architecture.

A good example is the text preprocessing flowchart (also shared by fchollet on Twitter): https://developers.google.com/machine-learning/guides/text-c...


This is something that's almost always glossed over and I'm glad they included it. It's easy to apply ML algorithms to perfect data that doesn't need cleaning and get great results. Finding a productive model when presented with a nuanced, messy problem is a much more difficult task, however, and something most ML crash courses don't focus enough time on.

I think there's a tendency on Hacker News and other tech websites to diminish the importance of having a PhD in ML fields. The problem solving and communication skills you learn during the course of a PhD program are precisely the skills companies value when they're trying to solve hard problems. It's important to know not just how to apply ML algorithms, but when they're appropriate.


Back in 2006, in highschool, I was investigating multilayer feed-forward NNs. I found them magical. I wrote the XOR problem etc. etc.

What always confounded me was the choice of the number and width of hidden layers. This is even now more confusing with the advent of deep and recursive networks. We need empirical work on this, that can be taught in much the same way that gravity is taught as an apple falling from a tree.

We need a determination of the entropy of a network, how to route that entropy and expolit it. Specific scenarios are not adequate.


> gravity is taught as an apple falls from a tree.

Is this more advocating for a theory of neural networks rather than empirical evidence?


These guides pop up left and right, lately. I can't comment on their quality (I assume it's somewhat decent) but it's kinda ridiculous to try compressing a college degree's worth of knowledge into a bunch of sleek online tutorials.


> it's kinda ridiculous to try compressing a college degree's worth of knowledge into a bunch of sleek online tutorials.

Honest question, why?

We used to give degrees (albeit hundreds of years ago) for material that now is covered, at a high level, in a single course (e.g. physical sciences). The amount of material to cover, and to master, increases dramatically over time. It makes sense to compress the knowledge to be delivered to a compendium so as to simply keep up with progress.


Not OP but I suspect they will say something about the math behind it. It's very true you can get quite adept at plug-and-play machine learning models (and indeed be quite successful) but the theoretical statistics, linear algebra and overall mathematical maturity take a long time to develop in my opinion.


Yep, the great thing about tensorflow are the canned algorithms. You can focus on your data-sets and problem rather than deep theory.


Good guides, just finished the text classification one. The approach is very much grow a good dataset and then find and tune a model the works well for your needs.


Rule #0: choose first principles over machine learning.

ML is a last-resort for problems you don’t understand. There are lots of these, but understanding the problem is better.


Yes and no. When I first started getting into this stuff I was amazed at the difference good feature selection and feature engineering made and I was meticulous about it. But in many real world cases now, an expert human can’t do it as well as quickly as a DNN running on multiple K80s. It all boils down to economics in the end.


Deep learning and blockchain noise seems to be dying out slowly. Are we going downhill on the hype curve, is it time to short those buzzwords?


Deep learning is far from slow down. The research speed has increased a lot in the last 5 years. It amazes me how fast the researchers are able to come up with new things. From one year to the next one there is something shiny and new.

I don't follow very much blockchain but seems like it was 6 years ago


I think most people just don't understand either very well. They feel the hype but are waiting for others to show them examples of what to do.


Yep. The plebs are onto us. Time for a crash ;)


why would every ML guide use cats as example


I know right! It should be about hotdogs instead.


Could use Trump tweets too.


Daily reminder for data scientist and machine learning types: fill your pockets while you can, because machine learning bootcamps are on the horizon!


There's been data science/machine learning bootcamps around for awhile (Galvanize/Metis being common examples in San Francisco), but apparently job placement is not in a good place (as with normal bootcamps).

Indeed Machine Learning/Deep Learning has become much more accessible thanks to the number of free guides such as this. But that means data science job placement will become more difficult as competition increases, with more gatekeeping/requirements (e.g. Masters/Ph.Ds)


The issues I've heard from a few people in hiring is that there is a surplus of junior data scientists from these camps and a shortage of senior data scientists to manage them. Problems not dissimilar to tech hiring in general, but companies need a lot more SWEs than data scientists.


Depends what the company is doing.

Most companies are going to utilise ML to some extent. Once technology and tooling improves they'll need boots on the ground engineers and not labs with R&D teams


Masters/PhDs can be boots-on-the-ground engineers too.


Honestly, unless these ML bootcamps are extensive courses on calculus, linear algebra, and statistics and not just "Here's k-means. Memorize it" I doubt they'll harm the market for grad school educated data scientists.


I'd like to offer a counterpoint. I attended one of the machine learning bootcamp mentioned above, and it was transformative for me. I got hired within a month, doubled my salary to over 100k, and landed a job that I enjoy and find intellectually stimulating. All this while having little to no technical experience (only math I took in college was intro to stats, and my pre-bootcamp career was in a non-technical capacity).

I completely understand why there is such a stigma around bootcamps. Nobody can deny that they don't afford the same depth that you'd get at a "real" program. But they can be amazing for career switchers like me, who had no real direction in college. Don't look down your nose at them.


Agreed


Yes, deep learning is what web-development was twenty years ago (and now everybody and their mother can build a website).


Too bad ML as a service is already largely cornered by $FANG


Too bad ML as a service is already largely cornered by $FANG

Wat?

Neither Facebook nor Netflix offer outsiders access to their ML platform, and you completely forgot Azure, which IMHO has the most mature offering of the big 3 in this space.


You cannot learn machine learning or deep learning in a few months. You can learn to copy what these guides do, but if you want to do something slightly different you will feel you know nothing (because you actually probably don't know anything about the maths behind why the things works, so when you want to change them you don't know how)


I don't deny that knowing the math / theory is useful, but wonder if we sometimes overestimate the degree to which it is essential. For example, backprop with SGD is a good foundation for many, many, many applications of NN's, and pre-built implementations exist that let you use the technique without understanding the details of the math. And with those tools, you can experiment with many different combinations of features, different architectures, etc.

Of course understanding the theory will be helpful in knowing which architectures are most likely to be productive and what-not, but this whole field is very empirical anyway. So if your experimenting is a little less guided my intuition rooted in theory, that's not exactly the end of the world.


I wasn't talking about back propagation. But sometimes you need to change the loss function, or the shape of the network. Or combine two models. Back propagation is the same for those examples, but not other math stuff.

The only thing that makes you think it is easy is because you are just copying what others have been doing and you don't change anything. Try to go beyond that and you will change your mind quite quickly.


Reason you need an education in this theory is twofold. How to fix something that is broken in limited time? How to assure this model is reliable? To confidently answer this from a place of reason derived from theory is going to be the real value.


Sure, but it's a continuum, not a binary dichotomy. Just like you can do more with your car if you have degrees in mechanical engineering and fluid dynamics, but a person with nothing but a high-school diploma can upgrade a camshaft.

The point is, you can do a lot of very useful things with ML, without needing the entirety of the theoretical underpinnings. Of course you can't do everything but not everybody needs to be able to do everything.


So as I said you can copy what others do. That is fine, but you don't k ow deep learning, you know how to apply it based on examples, which is is fine for a lot of things.


Have boot camps noticeably suppressed wages for software engineering in general?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: