This post introduces machine learning, provides context for the Apache Mahout project, and offers some specifics about recommender systems. Then, using Amazon Elastic MapReduce (EMR), we’ll tour the workflows for building a simple movie recommender and for writing and running a simple web service to provide results to client applications. Finally, we’ll list some ways to learn more and engage with the Mahout community.
Machine LearningMachine learning has its roots in artificial intelligence. The term implies that machine learning tools bring cognition and automated decision-making to data problems, but currently machine learning methods do not include computer thought. Even so, machine learning tools usually do employ some type of automated decision making, often iteratively working toward minimizing or maximizing a specific measurement about the performance of a model.
The field of machine learning encompasses many topics and approaches, usually falling into the categories of classification, clustering, and recommenders.
Classification is the process of predicting an unknown (dependent) variable based on a combination of other known (independent) variables, such as predicting a bank's customer attrition or predicting subscribers to a music service. In both of those cases, we use known variables about customers to predict their tendency to cancel their accounts. The following table lists several possible known variables:
|Demographic||City, state, age, gender|
|Behavioral||Bank customers spending habits
How often music service subscribers play certain artists
|Fees assessed to bank customers
How often music service customers experienced buffered music stream
Clustering, is the process of finding clumps, or groupings, of things in space. Geometrically, we usually talk about clustering vectors in some n-dimensional space. For example, the figure below shows four pairs of people represented by vectors in two-dimensional space where each dimension is some type of spending category, in this case entertainment or groceries.
In the top-left corner are two people who spend a similar amount but whose spending habits are in very different directions, and who therefore are very dissimilar. By the same reasoning, the two pairs of people-vectors in the bottom row have similar spending habits since their spending is in a similar direction to one another. The clustering we do with machine learning tools usually involves more than two dimensions, often in the range of hundreds to thousands, which the math supports by generalizing to arbitrary finite dimensional spaces.
Key to the topic of clustering is the concept of distance metric or measure that we use to define similarity. Some common measures are Euclidean, which is distance "as the crow flies”; cosine similarity, which maps a small between-vector angle to a high (close to one) similarity, and maps a high between-vector angle to a low (close to zero) similarity, and Tanimoto, which intuitively measures how much two vectors have in common versus how much they could have had in common.
Recommenders are systems that take some input, usually behavior-based, and predict which items users would tend to prefer. The popularity of recommenders over the past decade was spurred largely by publicity about the Netflix Prize, which ran from 2006 through 2009 and rewarded recommendations that beat the existing Netflix systems by a stated threshold.
Recommender performance is measured by comparing predictions with actual preferences, and is often enhanced by A/B testing in production.
Apache MahoutApache Mahout ships with most Hadoop distributions including Amazon Elastic MapReduce (EMR). Mahout is a machine-learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster, and now includes additional work distribution methods, including Spark, a computing framework originating in UC Berkeley’s AMPLab that is now an Apache project.
Mahout saw its first bug filed in January 2008, and from there has seen 1,555 Jira tickets filed, with 32 open at this writing. Refining and improving the code and documentation requires a community of contributors and users; the project is currently working toward a 1.0 release.
RecommendersMost people encounter the effect of a recommender system on a website where the recommender's predictions appear in a section of a web page. These recommendations help the site visitor find products to buy, new music to listen to, movies or television shows to watch, colleagues to hire, or potential mates to pursue.
The techniques for building recommenders have evolved since the early 1990s when the GroupLens research team built the USENET article recommender. The raw material has also evolved: in addition to news articles, there is a larger volume of online activity including clickstreams of users who follow links on websites; explicit “likes” and profile views; time spent perusing items or people; listening to music; and watching videos.
These advancements provide hooks into user behavior that help us determine how to recommend things to people. In the USENET example, users scan lists of articles for author and subject, click through, read, and close articles. In an online retail web site, shoppers search for products, browse product pages, click on photos to enlarge them, read reviews, and add products to a shopping cart. On a streaming music site, music consumers search for an artist or album, play tracks, fast-forward through tracks, and add the artist to their favorites, Streaming video sites work in a similar way. On a social networking site for professional or personal contacts, users search for and interact with other people. Each example includes a user interacting with some type of item in several ways.
Building a RecommenderTo demonstrate how to build an analytic job with Mahout on Amazon EMR, we’ll build a movie recommender. We will start with ratings given to movie titles by users in the MovieLens data set, which was compiled by the GroupLens team, and will use the “recommenditembased” example to find most-recommended movies for each user.
- Sign up for an AWS account.
- Configure the elastic-mapreduce ruby client.
Start up an Amazon EMR cluster (note the pricing and make sure to shut the cluster down afterward).
--create --alive --name mahout-tutorial --num-instances 4 --master-instance-
m2.2xlarge --ami-version 3.1 --
Get the MovieLens data
-f1-3 -d, > ratings.csv
hadoop fs -put ratings.csv
Run the recommender job:
mahout recommenditembased --input
.csv --output recommendations --numRecommendations 10 --outputPathForSimilarityMatrix similarity-matrix --similarityClassname SIMILARITY_COSINE
Look for the results in the part-files containing the recommendations:
hadoop fs -
hadoop fs -
You should see a lookup file that looks something like this (your recommendations will be different since they are all 5.0-valued and we are only picking ten):
User ID (Movie ID : Recommendation Strength) Tuples 35 [ 2067:5.0, 17:5.0, 1041:5.0, 2068:5.0, 2087:5.0, 1036:5.0, 900:5.0, 1:5.0, 2081:5.0, 3135:5.0 ] 70 [ 1682:5.0, 551:5.0, 1676:5.0, 1678:5.0, 2797:5.0, 17:5.0, 1:5.0, 1673:5.0, 2791:5.0, 2804:5.0 ] 105 [ 21:5.0, 3147:5.0, 6:5.0, 1019:5.0, 2100:5.0, 2105:5.0, 50:5.0, 1:5.0, 10:5.0, 32:5.0 ] 140 [ 3134:5.0, 1066:5.0, 2080:5.0, 1028:5.0, 21:5.0, 2100:5.0, 318:5.0, 1:5.0, 1035:5.0, 28:5.0 ] 175 [ 1916:5.0, 1921:5.0, 1912:5.0, 1914:5.0, 10:5.0, 11:5.0, 1200:5.0, 2:5.0, 6:5.0, 16:5.0 ] 210 [ 19:5.0, 22:5.0, 2:5.0, 16:5.0, 20:5.0, 21:5.0, 50:5.0, 1:5.0, 6:5.0, 25:5.0 ] 245 [ 2797:5.0, 3359:5.0, 1674:5.0, 2791:5.0, 1127:5.0, 1129:5.0, 356:5.0, 1:5.0, 1676:5.0, 3361:5.0 ] 280 [ 562:5.0, 1127:5.0, 1673:5.0, 1663:5.0, 551:5.0, 2797:5.0, 223:5.0, 1:5.0, 1674:5.0, 2243:5.0 ]
The recommendation strengths are at a hundred percent, or 5.0 in this case, and should work to finesse the results. This probably indicates that there are many more than ten “perfect five” recommendations for most people, so you might calculate more than the top ten or pull from deeper in the ranking to surface less-popular items.
Building a ServiceNext, we'll use this lookup file in a simple web service that returns movie recommendations for any given user.
Get Twisted, and Klein and Redis modules for Python.
Install Redis and start up the server.
Build a web service that pulls the recommendations into Redis and responds to queries.
Put the following into a file, e.g., “hello.py”12345678910111213141516171819202122232425262728293031323334353637383940
# Start up a Redis instance
# Pull out all the recommendations from HDFS
"hadoop fs -cat recommendations/part*"
# Load the recommendations into Redis
# Split recommendations into key of user id
# and value of recommendations
# E.g., 35^I[2067:5.0,17:5.0,1041:5.0,2068:5.0,2087:5.0,
# Put key, value into Redis
# Establish an endpoint that takes in user id in the path
# Get recommendations for this user
'The recommendations for user '
' are '
# Make a default endpoint
'Please add a user id to the URL, e.g. http://localhost:8080/1234\n'
# Start up a listener on port 8080
Start the web service.
twistd -noy hello.py &
Test the web service with user id “37”:
You should see a response like this (again, your recommendations will differ):
The recommendations for user 37 are [7:5.0,2088:5.0,2080:5.0,1043:5.0,3107:5.0,2087:5.0,2078:5.0,3108:5.0,1042:5.0,1028:5.0]
When you’re finished, don’t forget to shut down the cluster:
j-UNIQUEJOBID WAITING ec2-AA-BB-CC-DD.compute-1.amazonaws.com mahout-tutorial
./elastic-mapreduce --terminate j-UNIQUEJOBID
j-UNIQUEJOBID SHUTTING_DOWN ec2-AA-BB-CC-DD.compute-1.amazonaws.com mahout-tutorialAfter a few minutes:
j-UNIQUEJOBID TERMINATED ec2-AA-BB-CC-DD.compute-1.amazonaws.com mahout-tutorial
SummaryCongratulations! You’ve built a simple recommender that includes some of the conceptual functions and moving parts required for more sophisticated recommenders:
- Sourcing raw data
- Performing preparatory transformations
- Running an analytic job
- Interpreting the results
- Deploying results in a web service
Mahout includes other recommender methods, including one related to the Netflix Prize, the Alternating Least Squares(ALS) method, which conceptually fills in an almost empty matrix with predictions, along with others worth investigating. The project also includes other job types and algorithms for clustering and classification, as well as utilities, accessible at the command line.
The Mahout community actively engages with users and developers. You can get involved with Mahout by trying it on Amazon EMR, or by downloading your own copy and running it locally or on your own cluster. You can submit questions on how to use the tools and raise possible bugs to the user mailing list, and you can contribute to development by discussing issues with the dev mailing list, both here. If you encounter bugs or features you would like to see added to the project, you can file tickets and communicate about specific issues on the project Jira page. Be sure to create an account if you’d like to see the Create Issue button at the top of the page.