AutoMLPipeline – Create and evaluate machine learning pipeline architectures

sgt101 · on March 1, 2020

This is a lovely bit of programming, and showcases how amazing Julia is. BUT : standard warning, pouring data into a machine and looking for the best results according to your test of choice is highly unlikely to yield a good real world outcome. You might get something that appears to be useful for a while, but there is every reason to believe that it will blow up in your face further down the line.

You are literally playing with things that you don't understand! Don't do this kids!

willj · on March 1, 2020

Can you expand on this? If you’re monitoring for data drift and retraining every so often after deployment (not just “set it and forget it”), what are the problems that can happen?

wrkronmiller · on March 1, 2020

WARNING: NOT AN ML EXPERT.

I believe this falls under the category of "data snooping" wherein you are effectively creating a model-of-models and increasing your degrees of freedom. That increased level of complexity/number of DoF means you are far more likely to over-train.

You are more likely to pick a model with no predictive power that "happened to be right" about past data.

ldoughty · on March 1, 2020

If you're not careful, you're training your algorithm to continue promoting a bias... Perhaps one that you're not even aware of... Or perhaps someone else's bias imparts a bias on your algorithm (like turning your chat bot into a Nazi sympathizer -- an extreme example, but it could also be someone else's refusal of loans to minority groups, and your algorithm might look at debt ratio as a factor)

You can't keep assuming your inputs are always good... and while it may be okay to do some automated retraining, you really need a skilled person to revisit the inputs/outputs...

pilooch · on March 1, 2020

Agreed. AutoML is useful, but modeling and ensuring correct hypotheses are taken into account comes first.

sgt101 · on March 1, 2020

totally - asking the right question is the hardest step of all, and in my experience normally the second to last one (the last one being answering it, which often takes about 20 minutes).

kmax12 · on March 1, 2020

I agree. My experience has been that most machine learning tools help answer a question, but don't provide APIs and method to help you define the specific question you want to solve.

This is actually why I work on an open source library called Compose[0]. The goal of compose is to help people define prediction problems and extract training examples. By using a standardized API the tool can then automatically search the data for the training examples you can use for machine learning.

We're still in the early days of developing this tool, but we hope to structure and standardize this earlier step of the machine learning process that is normally ignored.

[0] https://github.com/FeatureLabs/compose

scottlocklin · on March 1, 2020

I'm confused, possibly because I never invested the time in Julia: is this some improvement over the standard Julia data pipeline? It looks like a Hadleyverse flavored R pipeline.

ddragon · on March 1, 2020

It's seems like a new competing approach in the ML space. The standard julia data pipeline right now usually involves either directly using an independent package for each step (or rolling your own since Julia has strong native support for fast data/number manipulation) or using a package that combines all those functionalities like MLJ.jl [1] for ML and Queryverse.jl [2] for data manipulation and visualization. The advantage is that since Julia lang focuses on composability, those frameworks can easily share core functionality so library creators don't need to reinvent the wheel and all frameworks can benefit from each other works.

[1] https://github.com/alan-turing-institute/MLJ.jl

[2] https://www.queryverse.org/

isusmelj · on March 1, 2020

Haven't seen the |> operator before. Is this Julia specific?

e.g. used here: pdec = @pipeline (numf |> pca) + (numf |> ica)

truculent · on March 1, 2020

I believe that it appears in a few other functional languages. Off the top of my head:

* F# and Elm have the same operator

* Haskell’s lens library (I think) has the same one. In the standard library there is `$` which functions as a “reverse” pipe

* R’s magrittr (part of the tidyverse which almost functions as an enhanced standard library) has `%>%` which works similarly

Renesat · on March 1, 2020

In Julia this operator works like this:

  (x |> f) == f(x)

pizza · on March 1, 2020

That's a nice readme