How Amazon wants to bridge the data science gap by bringing machine learning to the cloud

This article, How Amazon wants to bridge the data science gap by bringing machine learning to the cloud, originally appeared on TechRepublic.com.

Businesses are increasingly looking for ways to boost their bottom line by mining the data they collect.

But it's difficult for firms to extract meaningful information when data scientists are in short supply.

In response to this skills shortfall, major cloud providers have set up on-demand services to give businesses a chance to get started with machine learning.

Machine learning is a technique that allows computers to look for patterns in data and that powers the online recommendation engines that suggest books or films you might enjoy. Firms can use machine learning models to make useful predictions: such as 'Is this email spam?' or 'How many items are expected to sell in this region?'

matt-wood.jpg
matt-wood.jpg

AWS general manager for data science Matt Wood

 Image: Amazon

Amazon, Microsoft and Google provide on-demand machine learning services via their respective cloud platforms, each with differing levels of accessibility to developers without a background in statistics.

Amazon has been using machine learning since its early days as an online book seller, when it needed a way to help its human editors choose recommendations from its one million strong library.

"We decided very early on as an organisation that machine learning was going to be important as our business grew," said Matt Wood, general manager for data science at Amazon Web Services (AWS).

"We had a decision to make. Do we want to go off and hire a whole load of machine learning experts and specialists when those guys and girls are very rare? That's a very rarefied mix of skills of statistics, of cross-validation, of algorithm design."

Instead Amazon decided to hire a relatively small group of machine learning experts to build an internal service that all of its developers could use.

"We saw this flair of innovation because developers didn't have to spend a whole bunch of time genning up on machine learning to be able to put its benefits to use," he said, citing the service's role in areas such as fulfilment, capacity planning, supply chain management and identifying counterfeit goods.

The public machine learning service offered by AWS today is based on the same algorithms the firm makes available to its staff internally. Customers can build machine learning models using data stored in Amazon's Relational Data Service with a MySQL backend, the S3 object store, or the Redshift data warehousing service to train them.

These models can be used to make various types of predictions. Binary classification is used to predict one of two possible outcomes - 'Is this email spam or not?'. Multiclass classification to forecast one of three or more possible outcomes and the likelihood of each one - 'Is this product a book, a movie, or an article of clothing?'. Regression is used to predict a number - 'What is the temperature likely to be tomorrow?'.

The service will also attempt to automatically validate the data and, where possible, transform it into a more useful form, for example extracting the ZIP code or postcode from an address.

Once the model is built developers can access it via the AWS console or API calls, allowing the predictions to be fed to an app or online service. Models can be fine tuned using sliders in the console.

"The developer needs to know very, very little about machine learning. The machine learning chops are managed by the service," Wood said.

Cloud-based services like this reduce the difficulty of experimenting with machine learning, reducing the time and money needed to learn the skills to get started, he said.

Amazon tested how much easier the service made it for developers to get started with machine learning, tasking two developers without a machine learning background to build a model for predicting a person's gender from their first name.

It took the developers one month to build their model, which was trained using census data and predicted gender with 92 percent accuracy. In contrast it took a developer with no machine learning knowledge 20 minutes to build the same model with the same prediction accuracy using Amazon's service.

That's not to say these cloud services are suited to everyone's machine learning needs.

For one, while they may reduce the cost of getting started, they can be expensive to use in the long term. Amazon's service can cost in the region of about $100 per one million predictions.

As one entrepreneur said: "This would be really nice to use at my startup, but it's cost prohibitive even on a very large budget."

The service has also drawn criticism for locking users in, with the service not allowing users to export and import models.

"I don't see how any company with a lick of sense would lock down their prediction model into AWS," as one user on the developer forum Hacker News said.

In spite of these criticisms of the still fledgling service, Wood believes it will lead to more experimentation with machine learning at companies that would have previously not known where to start.

"The key for me is productivity and making sure developers have access to to this stuff."