Over at the All Things Distributed blog, Werner Vogels writes that the new Amazon SageMaker is designed for building machine learning algorithms that can handle “an infinite amount of data.”
In machine learning, more is usually more. For example, training on more data means more accurate models. To help their AWS customers take on this challenge, the company launched Amazon SageMaker at their recent re:Invent conference.
For many customers, the amount of data that they have is indistinguishable from infinite. Bill Simmons, CTO of Dataxu, states, “We process 3 million ad requests a second – 100,000 features per request. That’s 250 trillion ad requests per day. Not your run-of-the-mill data science problem!” For these customers and many more, the notion of “the data” does not exist. It’s not static. Data always keeps being accrued. Their answer to the question “how much data do you have?” is “how much can you handle?”
To make things even more challenging, a system that can handle a single large training job is not nearly good enough if training jobs are slow or expensive. Machine learning models are usually trained tens or hundreds of times. During development, many different versions of the eventual training job are run. Then, to choose the best hyperparameters, many training jobs are run simultaneously with slightly different configurations. Finally, re-training is performed every x-many minutes/hours/days to keep the models updated with new data. In fraud or abuse prevention applications, models often need to react to new patterns in minutes or even seconds!
To that end, Amazon SageMaker offers algorithms that train on indistinguishable-from-infinite amounts of data both quickly and cheaply. This sounds like a pipe dream. Nevertheless, this is exactly what we set out to do.
To handle unbounded amounts of data, our algorithms adopt a streaming computational model. In the streaming model, the algorithm only passes over the dataset one time and assumes a fixed-memory footprint. This memory restriction precludes basic operations like storing the data in memory, random access to individual records, shuffling the data, reading through the data several times, etc.
Amazon SageMaker offers production-ready, infinitely scalable algorithms such as:
- Linear Learner
- Factorization Machines
- Neural Topic Modeling
- Principal Component Analysis (PCA)
- K-Means clustering
- DeepAR forecasting
I think the time is here for using large-scale machine learning in large-scale production systems. Companies with truly massive and ever-growing datasets must not fear the overhead of operating large ML systems or developing the associated ML know-how. AWS is delighted to innovate on our customers’ behalf and to be a thought leader, especially in exciting areas like machine learning. I hope and believe that Amazon SageMaker and its growing set of algorithms will change the way companies do machine learning.