This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.

Why isn't supervised machine learning more automated?

Why doesn't ML provide a big red "Go" button that automatically applies all the commonly effective techniques (SVM, random forests, ANN, ...) with automatic feature selection and parameter tuning and validation and overfitting-avoidance and whatever else, and return a predictive function that's a weighted average of the most effective models?

Is there some reason such a button wouldn't be useful? (or does the button already exist in a tool somewhere?)
15 Answers
Alex Clemmer
Alex Clemmer, ∀x: ML+x
In order to address the stated question ("Why isn't supervised machine learning more automated?"), we have to begin to understand how complicated supervised learning really is. (We'll get to the other questions in the "details" section in a minute.)

Here's how complicated it is.

Say you work at a lab somewhere. Your boss, Professor George O'Jungle, hands you a box marked "data". He mumbles, "use machine learning to classify this data", and then walks away. Here is a nifty picture:

fig 1: George pictured here in skirt due to it being International Subvert Patriarchy Day


You're pretty busy, and groan out loud:

Q: Is there some way to automate this learning task?
A: No, at least not at this point in time. You don't even know what's in the box. It could be hamsters or something equally ridiculous. Before we do machine learning, we must know what our observable phenomena is.

Ok, so you decide to find out. You open the box. Inside are papers with numbers on them.
Great, the empirical phenomena are something sane. If it was, like, bird migration patterns, we'd have to change real observations about that into useful learnable data. But here we have just a bunch of numbers. Computers are good at looking at doing stuff with numbers, so it looks like we're getting something for free.

So now you ask:

Q: Now can I automate this learning task?
A: Still no. Your data are just numbers on a paper. In order to do machine learning, there must be a meaningful interpretation of our data. Interpreting the data is often called "cleaning" the data. Sometimes this step is trivial. For example, if your data are JPEG images, then you should probably just "interpret" each binary as a JPEG image. In other cases, it is less straightforward. For example, if your data are emails, then you'll want to remove the headers, HTML tags, images, and other vestigial data that could needlessly harm your algorithm. In other words, you are "interpreting" the data as a set of emails, where "email" is really just "text in the body" or something. Not doing this correctly can literally ruin your ability to do machine learning, so don't ignore it!

Anyway, you now set off to find out how all these pages of numbers should be interpreted. You ask Professor George. He explains that they are images. He shows you how to use his expensive image-interpreting machine. You feed all the papers into this image-interpreting machine:
[Cow image sources: File:Cow female black white.jpg, File:Cow-IMG 2050.JPG]

Now you have a nice stack of images. So once again, you come back to your original question:

Q: Ok, now can I automate this learning task?
A: Sorry, still no. You may have a useful interpretation of your data, but you have not specified your hypothesis set. A hypothesis basically maps things to some set of classes (i.e., it "classifies" things), and a hypothesis set is the set of possible hypotheses. Examples of hypothesis sets are the SVM and the perceptron. Examples of a hypothesis are weight vectors that "classify" your data.  Note that hypothesis sets make assumptions about the data (for example, independence assumptions), so choosing the right model is often a balance between hypotheses that are (1) tractable to learn, and (2) expressive. In order to do machine learning, you need to know which hypothesis class you're using, what your learning algorithm is (e.g., gradient descent), and what classes you're mapping to.

So once more you consult Professor O'Jungle. He explains that your task is to separate pictures into piles of those that contain cows and those that do not. You can use whatever hypothesis set and learning algorithm you like. "Great!" you think. "I choose the SVM, with whatever learning algorithm is fastest, and the rest will be easy." You sit back and grin.

Q: ... Because now I can completely automate this learning task, right?
A: Wrong. You may have cleaned data, and you may have have a task, but you did not specify which features of the images are relevant to learning this task. The reason is that your hypothesis set is making assumptions about your data. In particular, most hypothesis sets designed for classification assume that your data is a [math]d[/math]-dimensional real vector, which can be interpreted as points in [math]d[/math]-dimensional space that need to be "separated." Why and how is a technical question for another time. The question for now is: how do we take images and turn them into vectors?

You head back yet again to prof. O'Jungle. Your consolation seems to be that we must be most of the way to automating the learning by now. Or you grimly hope so anyway. You ask what features he wants to use. He slaps his forehead and exclaims that he is an idiot for forgetting to show you his feature-extract-o-tron. It's not really automatic, because someone had to make it especially for this problem. And in general, it is not clear how to automatically select these features. One good attempt is convolutional deep belief nets, but they're not quite "there" yet. And anyway, while this solution isn't really automatic, it is automatic for you. And that's good enough.

You slump back at your desk. This had better work. You feed all the images in and get back some nice vectors of real numbers:


Finally, you think:

Q: Ok, surely now I can automate this learning task, right?
A: Sorry, but no. Now you have to do the parameter tuning.

The good news is that this is pretty much automatic. First you randomly split your data into "training" and "testing" sets. Then pick a set of values that seems reasonable for your parameters. Then you train your model on the "training" data. Then you test the model on the "test" data. Finally, you select whichever model had the best results.

There probably isn't a way to completely automate this step away in general, but it's already pretty close.

So you end up with a well-tuned model a nice result. You look back at the process. None of the steps were really automatic, and in fact, it seems like it would be pretty hard to automate any of them.

You grumble to your colleagues about this:

Why do machine learning at all if it can't be completely automatic?

And the answer is simple: because it actually saved you a lot of time already. Imagine if you had tried to program your cow detector by hand, without using ML. Sure, it's not completely automatic, but no free lunch, right? (Actually it is provable that there is no free lunch, but that's for a different post.)

You might complain that this is how you know machines aren't "really intelligent", but the bottom line is that if we didn't have ML, we would not be able to quickly build things like search engines or cow detectors. That is, we could build them, but we'd build them much slower.

So: we use ML because it helps us to quickly deal with large volumes of arbitrary data. It isn't magic, but it works well enough.

Misc: the "details" section of the question:

The  question details ask why we can't completely automate (1) parameter  tuning and (2) feature selection. In the case of (1), the short answer  is that parameter tuning is already the most automatic thing about  supervised ML -- you get a working classifier, then optimize it by  running it over some range of parameters, and then select the "best"  model. Easy. (2) is much harder, and the short answer is that we're  still working on automatic feature selection. Convolutional deep belief  nets have great promise, but it will be many years before this is  viable, if it ever turns out to be.





If you thought this was useful and you enjoy ML, you might also enjoy my Quora blog, which is primarily concerned with ML-ish problems.
Franck Dernoncourt
Franck Dernoncourt, PhD student in AI @ MIT
I think it's just a matter of time before more end-to-end solution appear and mature.

Some ideas: Open-source MLaaS (Machine Learning as a Service):

There's a project called MLbase under development at UC Berkeley.  It's designed with distributed computing in mind, and another goal is to automatically (and somewhat efficiently) try many different algorithms and hyperparameters.  The second thing (which they call ML Optimizer) isn't ready yet, as far as I know.  For now, you might find their Scala/Spark implementations of distributed algorithms useful if you decide to roll your own model search scheme.

More detailed from their website:
  • MLlib: A distributed low-level ML library written against the Spark runtime that can be called from Scala and Java. The library includes common algorithms for classification, regression, clustering and collaborative filtering.
  • MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib.
  • ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI.

The source code of MLI and MLlib has been released, but the ML Optimizer is under active development and I don't see anywhere to download it (which is annoying as the ML Optimizer is the layer that would match the OP's requirements). The whole MLbase framework relies on Apache Spark (free, open-source). MLlib can be considered as one module on top of it amongst a few other useful ones:


MLlib fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9), and Spark can be run standalone, on EC2, Mesos or Hadoop.


The MLlib's user guide lists all supported machine learning models:


There have been a bunch of publications on MLbase:
  • E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. In International Conference on Data Mining, 2013. (pdf)
  • T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan. MLbase: A Distributed Machine Learning System. In Conference on Innovative Data Systems Research , 2013. (pdf)
  • A. Talwalkar, T. Kraska, R. Griffith, J. Duchi, J. Gonzalez, D. Britz, X. Pan, V. Smith, E. Sparks, A. Wibisono, M. J. Franklin, M. I. Jordan. MLbase: A Distributed Machine Learning Wrapper. In Big Learning Workshop at NIPS, 2012. (pdf)
Arun Iyer
Arun Iyer, I worked in the area of reinforcement learning as part of my master's thesis.
There is a whole sub-field in Machine Learning called Meta learning (computer science) dedicated to just this question. There are two aspects to this - Data and Machine Learning Algorithms (Duh!). The challenge is to understand both of them. What do I mean by this?

A machine learning practitioner over several years of experience would ask himself these questions:
1] How big is the data?
2] Is data sparse?
3] Are the features boolean-valued or categorical or bounded-continuous or unbounded continuous or mixed?
4] Are there any missing attribute values? If yes, should I fill in the missing values? If yes, how? Pearson correlation? Matrix completion?
5] Should I normalize the data?
6] Given answer to 3 and 5, how do I normalize the data? Which normalization scheme should I use?
7] Is there bias in the data? Is there Selection bias in how the labeled data was obtained? If there is bias, how do I offset this bias? Importance weighting? Active Learning?
8] Do I need to reduce features? Are there noisy features? Is feature selection necessary? Which feature selection scheme (subset evaluation measure) to use?
One has to answer all these questions everytime just to gain an understanding of the data. Now, the answers to the data questions has large impact on the algorithms used. Ideally, that should not be the case, but unfortunately it is. In many algorithms you can get improved performance just by normalization but that is not necessarily true for all algorithms. The size of the data dictates how complex your hypothesis set can be. This implies that we need to be able to understand algorithms and rank them not just by their performance but also by their complexities. Almost all the time, the answers and the decisions taken are driven by domain-knowledge or domain-experience. What's more, what works for one domain may not work for some other domain, which implies that certain observations we make, as users, are domain dependent.

Hopefully, in the years to come we will see development in the meta-learning community which might make this whole process automated and alongside, the supervised learning community, will gain a better understanding of the datasets and algorithms.

Active Learning:
An alternative viewpoint could be: why try to understand the data and the algorithms? Most algorithms make i.i.d assumption of the data and as long as training data is sufficiently large and well represented, algorithms like SVM are guaranteed to give reasonable generalization performance. Why not just sample enough points and label them so that any underlying bias is eliminated and there is improvement in generalization. Active Learning, in some sense, is a step towards this. This is not particularly a bad approach, given that most organizations which use data-critical applications can afford few editorial resources. Combine this with the rise of crowd-sourcing applications like MTurk and CrowdFlower and you can get a very potent classifier with few additional cost.

We will have to wait and see, how this space will further pan out.
Andrej Karpathy
Andrej Karpathy, Machine Learning PhD student at Stanford
The picture is more complicated because
1. data comes in many different forms, but more importantly
2. there are usually many other surrounding objectives you may or may not care about, other than accuracy.

Many of these approaches (SVMs, Neural Nets, Random Forests, PGMs, etc.) have their pros and cons that depend on many variables, for example:
- How much data do you have w.r.t. dimensionality?
- How "easy" do you suspect your problem to be? Is it likely linearly separable? Equivalently, how good are your features?
- Do you have many mixed data? Missing data? Sparse data? Categorical/Binary data mixed in?
- Do you need training to be very fast?
- Do you need testing to be very fast on new out of sample data?
- Do you need a space-efficient implementation?
- Would you prefer a fixed-size (parametric) model, or is non-parametric ok?
- Do you want to train the algorithm online as the data "streams" in?
- Do you want confidences or probabilities about your final predictions?
- How interpretable do you want your final model to be?
etc. etc. etc.

I do agree that for smaller datasets where one does not care much about efficiency, one can in principle automate a large portion of this process. I think Weka is closest to something you have in mind.
Satvik Beri
Satvik Beri, Mad (Data) Scientist
One issue is that "best practices" aren't quite standard enough yet. The trick is to come up with something that's "off-the-shelf" enough to be easy but allows enough customization to be useful. Decision Trees and Linear Regression can work with pretty much any dataset-it's easy to just throw these algorithms at whatever you have. You could certainly make a program that lets you easily create an ensemble of these, and that's pretty much available off-the-shelf with scikit-learn (scikit-learn: machine learning in Python)

Neural Networks, SVMs, and Deep Learning take significantly more preprocessing. For SVMs I think the preprocessing is fairly standard at this point, so someone could just write a tweakable program to do so. For Neural Networks it's harder, and for Deep Learning it's currently very difficult.
View More Answers