Amazon Machine Learning gives data science newbies easy-to-use solutions for the most common problems
By Martin Heller as written on infoworld.com
As a physicist, I was originally trained to describe the world in terms of exact equations. Later, as an experimental high-energy particle physicist, I learned to deal with vast amounts of data with errors and with evaluating competing models to describe the data. Business data, taken in bulk, is often messier and harder to model than the physics data on which I cut my teeth. Simply put, human behavior is complicated, inconsistent, and not well understood, and it's affected by many variables.
If your intention is to predict which previous customers are most likely to subscribe to a new offer, based on historical patterns, you may discover there are non-obvious correlations in addition to obvious ones, as well as quite a bit of randomness. When graphing the data and doing exploratory statistical analyses don’t point you at a model that explains what’s happening, it might be time for machine learning.
Amazon’s approach to a machine learning service is intended to work for analysts to understand the business problem being solved, whether or not they understand data science and machine learning algorithms. As we’ll see, that intention gives rise to different offerings and interfaces than you’ll find in Microsoft Azure Machine Learning (click for my review), although the results are similar.
With both services, you start with historical data, identify a target for prediction from observables, extract relevant features, feed them into a model, and allow the system to optimize the coefficients of the model. Then you evaluate the model, and if it’s acceptable, you use it to make predictions. For example, a bank may want to build a model to predict whether a new credit card charge is legitimate or fraudulent, and a manufacturer may want to build a model to predict how much a potential customer is likely to spend on its products.
In general, you approach Amazon Machine Learning by first uploading and cleaning up your data; then creating, training, and evaluating an ML model; and finally by creating batch or real-time predictions. Each step is iterative, as is the whole process. Machine learning is not a simple, static, magic bullet, even with the algorithm selection left to Amazon.
Amazon Machine Learning can read data -- in plain-text CSV format -- that you have stored in Amazon S3. The data can also come to S3 automatically from Amazon Redshift and Amazon RDS for MySQL. If your data comes from a different database or another cloud, you’ll need to get it into S3 yourself.
When you create a data source, Amazon Machine Learning reads your input data; computes descriptive statistics on its attributes; and stores the statistics, the correlations with the target, a schema, and other information as part of the data source object. The data is not copied. You can view the statistics, invalid value information, and more on the data source’s Data Insights page.
The schema stores the name and data type of each field; Amazon Machine Learning can read the name from the header row of the CSV file and infer the data type from the values. You can override these in the console.
You actually need two data sources for Amazon Machine Learning: one for training the model (usually 70 percent of the data) and one for evaluating the model (usually 30 percent of the data). You can presplit your data yourself into two S3 buckets or ask Amazon Machine Learning to split your data either sequentially or randomly when you create the two data sources from a single bucket.
As I discussed earlier, all of the steps in the Amazon Machine Learning process are iterative, including this one. What happens to data sources over time is that the data drifts, for a variety of reasons. When that happens, you have to replace your data source with newer data and retrain your model.
Training machine learning models
Amazon Machine Learning supports three kinds of models -- binary classification, multiclass classification, and regression -- and one algorithm for each type. For optimization, Amazon Machine Learning uses Stochastic Gradient Descent (SGD), which makes multiple sequential passes over the training data and updates feature weights for each sample mini-batch to try to minimize the loss function. Loss functions reflect the difference between the actual value and the predicted value. Gradient descent optimization only works well for continuous, differentiable loss functions, such as the logistic and squared loss functions.
For binary classification, Amazon Machine Learning uses logistic regression (logistic loss function plus SGD). For multiclass classification, Amazon Machine Learning uses multinomial logistic regression (multinomial logistic loss plus SGD). For regression, it uses linear regression (squared loss function plus SGD). It determines the type of machine learning task being solved from the type of the target data.
While Amazon Machine Learning does not offer as many choices of model as you’ll find in Microsoft’s Azure Machine Learning, it does give you robust, relatively easy-to-use solutions for the three major kinds of problems. If you need other kinds of machine learning models, such as unguided cluster analysis, you need to use them outside of Amazon Machine Learning -- perhaps in an RStudio or Jupyter Notebook instance that you run in an Amazon Ubuntu VM, so it can pull data from your Redshift data warehouse running in the same availability zone.
Recipes for machine learning
Often, the observable data do not correlate with the goal for the prediction as well as you’d like. Before you run out to collect other data, you usually want to extract features from the observed data that correlate better with your target. In some cases this is simple, in other cases not so much.
To draw on a physical example, some chemical reactions are surface-controlled, and others are volume-controlled. If your observations were of X, Y, and Z dimensions, then you might want to try to multiply these numbers to derive surface and volume features.
For an example involving people, you may have recorded unified date time markers, when in fact the behavior you are predicting varies with time of day (say, morning versus evening rush hours) and day of week (specifically workdays versus weekends and holidays). If you have textual data, you might discover that the goal correlates better with bigrams (two words taken together) than unigrams (single words), or the input data is in random cases and should be converted to lowercase for consistency.
Choices of features in Amazon Machine Learning are held in recipes. Once the descriptive statistics have been calculated for a data source, Amazon will create a default recipe, which you can either use or override in your machine learning models on that data. While Amazon Machine Learning doesn’t give you a sexy diagrammatic option to define your feature selection the way that Microsoft’s Azure Machine Learning does, it gives you what you need in a no-nonsense manner.
Evaluating machine learning models
I mentioned earlier that you typically reserve 30 percent of the data for evaluating the model. It’s basically a matter of using the optimized coefficients to calculate predictions for all the points in the reserved data source, tallying the loss function for each point, and finally calculating the statistics, including an overall prediction accuracy metric, and generating the visualizations to help explore the accuracy of your model beyond the prediction accuracy metric.
For a regression model, you’ll want to look at the distribution of the residuals in addition to the root mean square error. For binary classification models, you’ll want to look at the area under the Receiver Operating Characteristic curve, as well as the prediction histograms. After training and evaluating a binary classification model, you can choose your own score threshold to achieve your desired error rates.
For multiclass models the macro-average F1 score reflects the overall predictive accuracy, and the confusion matrix shows you where the model has trouble distinguishing classes. Once again, Amazon Machine Learning gives you the tools you need to do the evaluation in parsimonious form: just enough to do the job.
Once you have a model that meets your evaluation requirements, you can use it to set up a real-time Web service or to generate a batch of predictions. Bear in mind, however, that unlike physical constants, people’s behavior varies over time. You’ll need to check the prediction accuracy metrics coming out of your models periodically and retrain them as needed.
As I worked with Amazon Machine Learning and compared it with Azure Machine Learning, I constantly noticed that Amazon lacks most of the bells and whistles in its Azure counterpart, in favor of giving you merely what you need. If you’re a business analyst doing machine learning predictions for one of the three supported models, Amazon Machine Learning could be exactly what the doctor ordered. If you’re a sophisticated data analyst, it might not do quite enough for you, but you’ll probably have your own preferred development environment for the more complex cases.