Machine Learning: The future of HPC?

By Ramon Lafleur | June 14, 2015

Take large amounts of data processed through Big Data platforms. Throw in a dash of Business Intelligence to add visualization and analysis capabilities. Make them apply by machines, you get the basics of a Machine Learning platform.

Machine Learning is the latest hype in modern computing. One of the most commonly accepted definition refers to computer systems that improve in gaining experience from the analyzed data. To reach this goal, researchers, statisticians and data scientists define software models capable of processing and analyzing large amounts of data through dedicated machines. The final objective: to establish predictive models based on the data correlation and the observed trends.

Because the conclusion is clear: this analytical work done by humans is already showing its limits, unable to cope with the flood of data that the Internet of Things promises to carry. For we must then be able to process and analyze massive amounts of data to identify emerging trends and apply the necessary measures whenever necessary. Or to identify discriminatory elements, or to detect weak signals that could influence a decision. In an increasingly digital-dependent world, the stakes are high!

Tomorrow, the Internet of Things
Since the Internet of Things will generate 90% of data tomorrow, it is an appropriate response in the form of algorithms enabling machines to analyze these data of a new genre in unprecedented quantities. This is where the Machine Learning makes sense. But the Internet of Things is not the only reason for machine learning. Big Data is equally important. We already have a global amount of 8-Zetta Bytes of data worldwide. Some of this data is made up of static data, but the dynamic share will grow in coming years, to reach according to IDC, 40-Zetta Bytes in 2020.

A software + hardware equation for an efficient Machine Learning service
Developments in Machine Learning platforms are based on two pillars. On one hand by creating algorithms more efficient that can handle, filter, and analyze the data, and on the other hand to choose the best equipment capable of storing, reading, processing, analyzing and writing data. A set of algorithms, processors and storage devices thus provide the foundation of a system of performance Machine Learning.

Three aspects to solve a problem Machine Learning
As Dennis DeCoste, Machine Learning Director at eBay stated, a Machine Learning problem should be treated in three ways:
random algorithms capable of growing to scale (scalable), a choice of hardware capable of meeting the requirements to achieve an effective treatment, and a compilation model broken down into small and efficient modules tested and refined individually rather than a complex algorithm to optimize.

Some concrete examples of Machine Learning
How useful can Machine Learning be in everyday life? Here are some answers.

Social Networks: this is a rich ground for extremely complex information to be processed for human teams given the amount of daily data. Facebook and Twitter are a network of human sensors from which it is possible to extrapolate trends, as demonstrated by IBM’s research teams. So much so that they prefer to call them “social sensor” rather than social networks. On several occasions, the influence of these social sensors was demonstrated in scholarship and public health (flu). To illustrate, Twitter alone represents 250 million tweets per day. The information considered by IBM’s Cognos Consumer Insights engine on each Tweet IBM, are its virality (the number of retweets of a given tweet), seniority (based on date and time of publication), and correlated with the number of tweets’ related content (based on a lexical analysis of its content).

Stock Exchange and Finance: a network such as TripAdvisor used to rank restaurants and hotels worldwide generates an enormous amount of data. Machine Learning allows to treat such huge amounts of data by applying a semantic processing to separate the positive feedback from the negative ones. Then it is possible tp extend or limit them on a given geographic area to apprehend the top ranked ones. Apply this treatment to another geographic area (city, region or country) and you have a multiple comparison basis for the list of most popular destinations, for example.

Likes: IBM demonstrated the causal link between potentially fraudulent financial transactions by examining the history of raw financial data on all global monetary exchanges, correlated with the financial information provided by the media (Bloomberg). The two types of data were the subject of an in-depth analysis by IBM’s Watson Automated Integration engine, which helped pinpoint the truthfulness on the observed unusual financial transactions corresponding to a real event or not.

Sale predictions: his type of analysis is done to correlate the data pertaining to acts of purchase. Type of purchases (food, clothing, entertainment), products purchased, value of the purchased product, purchasing or seasonal recurrence are all elements that can be crossed to infinity to identify surging or dropping trends, buying habits of consumers and therefore, increase inventory forecasts accuracy.

Microsoft’s Azure ML initiative

Microsoft Azure Machine Learning is an online service that developers can use to build predictive analytic models (using data sets from multiple data sources), and then easily deploy these models as Web services. Azure ML Studio provides features applicable to many flow scenarios for building predictive models, provides easy access to data sources, and includes exploration data functions, visualization tools and application of ready to use ML algorithms. The ML Azure platform includes a hundred of ready to use scenarios to solve a wide range of scientific problems. Moreover this scenario library van be extended through R language modules, in order to respond to situations not covered by the standard functions.

An extensible ML platform
Once the module R language finalized, it can be shared with colleagues via GitHub. These modules can be used on non-standard data processing processes, such as managing data formats specific to a given area, increase flexible data transformation, or dedicated to data extraction. Better yet, an R module may incorporate any of the hundreds of pre-installed scripts on Azure ML, and even incorporate them into your own module.

Here is an implementation example: creating an R module that aggregates data in JSON format and adapts to Azure ML dataset format. Such a module is composed of three parts:

An R code file defines what the module must perform;
Optional accompanying files (configuration files or scripts R)
An XML file that determines the inputs, outputs and module parameters.

One could say that the XML file is the module skeleton, and the R code is its muscle. The module accepts input data in JSON format, and transforms it to generate a flat dataset. An accepted parameter is a string that specifies a zero replacement value. The corresponding R script is:

 parse_json.R: parse_json <- function(data_in, nullvalue="NA") {    library(RJSONIO)    library(plyr)    data_out <-ldply(fromJSON(as.character(data_in[1,1]),nullValue=nullvalue,simplify=TRUE))    return(data_out) 	}

The XML description file defines the name of the following modules, the R function call starts the execution of the module, the input and output datasets, and any associated parameters.

 parse_json.xml: <Module name="Parse JSON Strings"> <Owner>AzureML User</Owner>   <Description>This is my module description. </Description>   <Language name="R" sourceFile="parse_json.R" entryPoint="parse_json"/>     <Ports>       <Output id="data_out" name="Parsed dataset" type="DataTable">         <Description>Combined Data</Description>       </Output>       <Input id="data_in" name="JSON formatted dataset" type="DataTable">                 <Description>Input dataset</Description>       </Input>     </Ports>     <Arguments>       <Arg id="nullvalue" name="Null replacement value" type="string" isOptional = "true">         <Description>Value used to replace JSON null value</Description>       </Arg>     </Arguments> </Module>

To add this module to your Azure ML account, simply create a zip file containing all the files and upload it by clicking on the menu + NEW> in your Azure Module ML Studio work environment. Once uploaded, your module will appear in the “custom” category of the range of modules, along with pre-installed modules.

The Google Prediction API
Google provides developers and researchers its Google Cloud platform within which lies the Prediction API. According to Google, this API allows already to achieve objectives such as:

Predict whether a user will enjoy a movie based on movies he likes
Categorize emails as spam or not
Analyze feedback to determine if your products are popular or not
Predict the expenses of users based on their consumption habits

Google makes available to users of this API two sites that exploit it, one for spam detection and the second for predicting film preferences. Prediction API is accessible via a RESTful interface. Google provides several libraries and scripts to access the API. The latter provides functions of pattern-matching and Machine Learning capabilities from data examples that you have to provide. The model supplied with the data evolves to become a trained model capable of responding to requests.

The number of direct call to the API functions Prediction version 1.6 is rather small, and is limited to:

prediction.hostedmodels.predict
	Provides input data and requests data output from a hosted model.
prediction.trainedmodels.analyze
	Gets the model analysis and data on which the model was trained
prediction.trainedmodels.delete
	Deletes a trained model
prediction.trainedmodels.get
	Checks the status of training your model
prediction.trainedmodels.insert
	Trains a model of the Prediction API
List.prediction.trainedmodels.list
	List of available models
prediction.trainedmodels.predict.
	Provides the id of a model and request a prediction
prediction.trainedmodels.update.
	Adds new data to a trained model

The main steps in the use of the prediction API
A prediction model is only as good as the data quality with which it is fed, and which will calibrate the model in order to return relevant information in any application.

1 / Create your training data. You must create data to power your model adapted to the questions you need answers for. This is the most critical and complex step. You can also use a pre-trained model from the hosted template gallery and go to Step 4: Send a prediction query.

2 / Send your workout data to Google Cloud Storage using standard Google Cloud Storage tools.

3 / Train the model with your data. The Prediction API will train your model with your data of which you indicate the location. You must query the Prediction API to check it when the training is finished.

4 / Send a prediction query. Once the training phase is over, you can send a request to your model via the Prediction API. You will get back a response as a numerical value or as text according to data you have fed your model.

5 / (optional) Send additional data to your model. If you have a steady stream of new data consistent with the data used to initially feed your model, you will improve the relevance of results. This helps improve your model qs you add new data to it.

The two most important aspects of the use of Prediction API is to create and structure the data which is supplied with the model, and to formulate a relevant question that the API is able to respond.

Relevant predictions?
The term “prediction” may sound misleading because the Prediction API is only able to accomplish two specific tasks:

As a new element, to predict a numerical value based on similar values in the data used to train the model.
As a new item, select a category that best describes it, given a set of similar items for inclusion in the data used in the lead.

Both tasks can seem limited. However, if you formulate your request carefully and choose your data accordingly, the Google Prediction API will let inferred preferences of a user and keep the projection of future correlation values consistent with the data used to train your model.

eBay: 100 million queries per day
The data volume of a global platform such as eBay is enough to make your head spin: it amounts to 100 million users (buyers and sellers) and 10 million items per day that are processed and analyzed to determine their usage habits: research, clicks, wishes, and auction purchases. Regarding the object data, it is the price, titles, descriptions and images that are processed, with a history of several billion objects.

The Learning Machine to establish basic properties
These analysis for a given family of objects, the intrinsic properties. An aquarium is so defined by characteristics such as length, height, width, but also its volume, weight, accessories, etc. This allows to establish a complete taxonomy. And also to determine the distribution thus, 87% of the categories contain 17% of the items on eBay, and conversely, 1% of the categories contain 52% level.

The search engine, eBay’s queen App
The application on eBay that attracts most of the Machine Learning developments is the search feature. Full-text seeking requires a semantic decomposition, in addition to correlated criteria as the cheapest or closest item, etc. Behind the scenes this result is achieved by applying a temporal data mining algorithm based on Hadoop called Mobius. This massively parallelized mechanism can meet the 100 million daily queries with a very good response time of around a second.

Scalability? eBay is working on it …
eBay’s Learning Machine team does not intend to stop there, and is already thinking of implementing kernel, hashing and random projection algorithms in order to be able to cover more users and objects. This is not the only element needing to be adjusted to satisfy future needs. The ML teams have determined that the use of GPU acceleration solution allows improving performance by a factor of 20, by replacing 12 CPU cores at 3.5 GHz (ie 168 gigaflops) by an NVIDIA solution 690GTX capable of delivering 5.5 TeraFlops.

As we see, Machine Learning is a discipline that depends as much on algorithms than the quality of the analyzed data. And scalability in data volumes of a greater magnitude can only be achieved through hardware platforms making use of the most powerful CPUs and GPUs available to meet the increasing requirements of this new discipline.