If you shop online, use a streaming service like Netflix or Spotify, or are part of a social networking platform, chances are you’re bombarded with recommendations. Sometimes the recommendations are good, like Spotify-generated playlists that know you better than you know yourself. Other times they’re downright creepy, like LinkedIn suggesting you connect with an ex.
Powered by user data that inform data science models, these tailored recommendations make it easier and more efficient to find the things that are relevant to us, thus changing behavior around what we purchase and consume, who we connect with, which jobs we apply for, and so much more. Organizations and companies benefit from their use as well—recommendation engines help drive more traffic to websites, deliver more relevant content, and boost engagement in their userbase, which in turn increase revenues and customer satisfaction.
Types of recommendation engines
So, how do recommendation engines work, anyway? At their core, they are fancy filters. A filter takes a dataset (e.g. an inventory of products, a list of movies, a network of people) and selects items based on one or more parameters (e.g. only show me household products, action movies, or people who work at U.Group).
Recommendation engines go a step further by applying automatic filters specific to the unique user or product. The kinds of data science models used to generate recommendations depend on the data we have access to, user behavior on the platform, and the kinds of actions we want to promote. There are four common approaches to recommendation engines:
The popularity filtering model and Twitter:
The simplest of the four models, popularity filtering models list recommendations based on the highest ratings, the most likes, the most purchases, the most views, etc. To the left are Washington, DC’s Twitter trends at the time of writing, sorted by number of tweets (minus the enticing Applebee’s ad). These recommendations are not user-specific, although some filters, such as location, may be applied.
The collaborative filtering model and Amazon:
Collaborative filtering models provide recommendations by identifying users with similar characteristics. For instance, Amazon made the above recommendations to me based on customers who searched for the same product I did. Collaborative filtering uses data on users and user behavior to group them into buckets. Users in the same bucket receive the same recommendations.
Of course, if the model doesn’t have enough user data or if the user base is too extensive, grouping users becomes difficult and the resulting recommendations may be irrelevant. The example to the right shows the Amazon recommendations I was given upon viewing a cat-themed tea set with only 58 reviews. A customer who purchased the cat tea set also purchased a portable desk fan. Unless I’m planning to use a fan to cool my tea, this is not a good recommendation based on my interest.
The content filtering model and Netflix:
One way to avoid irrelevant recommendations due to insufficient user data is to utilize content filtering models. This model utilizes item-to-item similarity rather than user-to-user similarity, identifying and recommending items similar to what the user has purchased, liked, or rated highly in the past. These models are useful when there is known data on the item, but not on the user. In the Netflix example above, I am receiving recommendations similar to “Tidying Up with Marie Kondo,” a home improvement show I have watched and rated with a thumbs up.
The hybrid filtering model and Spotify:
While content filtering is great when we have limited user data, it does restrict the recommendations we can make to the user. Our preferences and behavior change depending on the preferences and behavior of those close to us. Imagine all the great people, products, and content you would have missed out on if your friends hadn’t recommended them to you! The best recommendation engines utilize both item data and user data to generate relevant suggestions.
Hybrid filtering models do just that. They combine two or more filtering techniques, using the advantages of one model to offset the disadvantages of the others. For example, content filtering cannot make recommendations to new users, as we don’t have data on their preferences and past behaviors. However, we can use collaborative filtering to cluster the user with similar users (e.g., based on gender, age, or location) and make recommendations based on the user’s groupings.
Spotify’s recommendation engine is a good example of a hybrid filtering model. Its system has three parts: one collaborative filtering model and two types of content filtering models.
- Collaborative filtering: Based on the tracks the user listens to, the songs they add to their playlists, and the artist pages the user visits, Spotify builds a profile on the user’s likes and preferences. Collaborative filtering then groups the user with other users who like the same songs, albums, or artists, and makes recommendations based on the music others in their grouping like.
- Content filtering with NLP: Spotify crawls the web for blog posts and text about specific songs and artists, using Natural Language Processing (NLP) to parse the text, extracting adjectives, sentiments, and related songs and artists. NLP can then create song and artist profiles containing top terms and their significance scores. These words are added to the content filtering model, which generates song or artist recommendations with similar profiles.
- Content filtering with convolutional neural networks: For songs that are new to Spotify and don’t have many listens, Spotify uses Convolutional Neural Networks (CNN) to parse raw audio and matches it with other songs. CNN analyzes song characteristics such as the time signature, key, mode, tempo, and volume. It then groups the new song with other songs that have similar audio profiles. If the user likes a song that is auditorily similar to the new song, Spotify plops that new song into the user’s Discover Weekly playlist.
Spotify’s hybrid filtering model is exactly why the Discover Weekly playlists are so good—they collect, parse, and incorporate a myriad of user and song data to pinpoint the kind of music you might like. Conversely, LinkedIn’s naïve suggestion to connect with your ex is because of limited user data–but this is a good thing! It would be unethical for LinkedIn to gather data on your private life. LinkedIn makes recommendations based on what it can glean from your profile; if one of your connections is connected to your ex, LinkedIn will assume you’re interested in connecting with them, too.
U.Group’s recommendation engines
We are currently building a recommendation engine for a client at U.Group using content filtering. We chose this model as the Minimum Viable Product (MVP) because we currently do not have robust user data. This recommendation engine takes a document a user has previously consumed, then finds and recommends similar documents.
Similar to Spotify’s recommendation engine, we determine similarities between documents using NLP. First, we convert text documents into Term Frequency – Inverse Document Frequency (TF-IDF) vectors. TF-IDF identifies important words in a corpus using term weights, effectively transforming text into numbers representing vectors in space. If documents repeatedly share rare and unique words, they are more likely to be clustered together in vector space. We can then use cosine similarity calculations to locate TF-IDF document vectors closest to our document vector of interest. Documents having the closest TF-IDF vectors are documents most textually similar to each other.
While this work is still under development, our goal for this recommendation engine is to help the client provide more relevant information to their audience, thus driving up engagement and maximizing the benefits they receive from the content available. As we learn more about the users, we can experiment with and incorporate additional models to enhance the value of the recommendations our engine provides. Ultimately, recommendation engines are not static algorithms to set and forget—they are constantly evolving as more data become available. You can bet Spotify’s current hybrid model is not their first recommendation engine, nor will it be their last.
In future articles, I will dive into how data can be gathered implicitly (web scraping or user behavior tracking) or explicitly (user feedback and surveys) to inform the models we build, the tricky business of measuring recommendation quality, and ethical issues that may arise from recommendation engines.