As an online quiz taker, it’s easy to overlook just how much work goes into helping you figure out which Harry Potter house you belong to or what kind of potato dish matches your personality. Even the simplest, most comical quizzes are powered by vast amounts of data—data that has often been curated, cleaned, and programmed by a data scientist.
Building a quiz from the ground up involves compelling problems and interesting data—which is exactly the kind of project I revel in as a data scientist. So you can imagine how ecstatic I was when I was asked to help build a quiz that recommends national parks to users! Digging through data about our national parks and figuring out a way to showcase them sounded like a dream project.
This blog post covers my journey of creating the Find Your Park Quiz—from beginning to end, discovery to deployment—and is framed within the typical data science process. While data science projects vary considerably in terms of client needs, problem space, available data, and modeling techniques, they more or less follow the same seven steps detailed below. Understanding the data science process can help data scientists—and those who work with them—solve complex data problems and build interesting and exciting things. Let’s dig in!
Step 1: Frame the Problem
This is the most important step in the data science process. Asking the right questions and understanding the problem at hand helps data scientists develop a strategy. What’s the goal? What kind of data do we need? What kind of models would be appropriate? Is the project feasible with current technology, resources, product design, and timeline?
Thanks to Sally Moriarty and the rest of the National Park Foundation (NPF) team, the goal for the Find Your Park Quiz was clear—to introduce users to new parks. For every Yosemite, Yellowstone, or Zion, there are several lesser-known parks. With the Find Your Park Quiz, we aimed to increase people’s knowledge of and interest in visiting these hidden gems.
Step 2: Obtain the Right Data
It should come as no surprise that a data scientist cannot do their work without data—and enough of the right kind, at that. Whether the data is right and enough depends on the data scientists’ understanding of the problem and their theory on how to solve it (Step 1).
There are many different types of data stored in just as many ways. The data may exist on databases that can be queried or scraped from the internet, publicly accessible or secured behind paywall and privacy regulations. However data scientists access it, they must be mindful of obtaining the data ethically and using it responsibly.
The data I needed existed on two websites: the Find Your Park website and the NPF website. Both websites use Drupal, a content management system (CMS) that stores and organizes web content on the backend using a SQL database.
I did not know much about Drupal and how the website taxonomies were organized. So, I asked for help. The team’s Drupal developer was kind enough to give me a Drupal 101 lesson, and U.Group’s Drupal database expert helped me write the query I needed. It was a complicated SQL query, so I really appreciated the help!
Step 3: Clean the Data
After getting the data, data scientists must process and clean it. Depending on data quantity, format, and messiness, this can take a good chunk of time. Data scientists must examine every column of the dataset for missing or corrupt values and determine how best to handle them. They also examine whether values should be merged or separated, and whether the data is properly formatted, among other things.
For this project, I first uploaded the data to a Jupyter Notebook, a common web application data scientists use to document and run code. Because my data came from two Drupal websites, I had to merge the two datasets into one. Thankfully, the two datasets shared the same unique park ID, which made the merge pretty straightforward.
If they hadn’t, I would have had to merge based on some other information, such as park name or location. This is often called Entity Resolution. Luckily for me, the data was pretty clean overall, and I was able to move on to the next step!
Step 4: Explore the Data
Once the data is clean, it’s time to start exploring it! When performing exploratory data analysis, data scientists make note of outliers, patterns, and trends that may help to solve the problem framed in step one.
For the Find Your Park Quiz, this step took the most time. In order to maximize the types of questions we could ask in the quiz, the majority of the information I needed was in the park descriptions. To process the text, I used a natural language processing method called Term Frequency – Inverse Document Frequency (TF-IDF) vectors.
TF-IDF vectors are composed of a list of weights determined by word frequency and rarity. Each park has its own TF-IDF vector with different weights for each selected keyword. A word like “the” has high frequency in each park description but is very common across all park descriptions. The TF-IDF weight for “the” would be low across all park vectors. On the other hand, each park’s distinguishing features would be less common across all park descriptions, giving them a higher weight on the vector.
Let’s look at a particular park vector. Compared to the other park descriptions, the text for Joshua Tree National Park contains more instances of the word “desert,” a word that is relatively rare across all park descriptions. The resulting TF-IDF vector for Joshua Tree National Park is more weighted for “desert.”
On the other hand, the word “beach” does not show up at all. Therefore, Joshua Tree’s vector weight for “beach” is 0. After converting all the park descriptions into vectors, we end up with a weighted matrix of park vectors that displays numerically how “beach” a park is, how “desert” a park is, how “forest” a park is, and about fifty other keywords.
Step 5: Build the Model
Now it’s time to build the model! There are many different models data scientists use to solve problems. I decided to use the dot product model.
For this model, we organized the user’s preferences into a vector, similar to our weighted park vectors. Except instead of weights for each keyword, we used 1s and 0s according to the user’s preferences.
So if a user wants to go to a desert and is not at all interested in beaches, we would multiply the user’s vector [Desert:1, Beach: 0] with the corresponding Joshua Tree vector, and add up the products together to form the final park score for Joshua Tree.
Once we multiply the user’s preferences against the entire weighted matrix of park vectors, we can find the top 10 park “scores,” which become the user’s park recommendations.
There are several benefits to using dot products over other models. Firstly, the questions aren’t dependent upon the user’s previous answers. Eliminating if-then statements made development easier, load times faster, and gave our content strategists more flexibility to create fun questions!
Secondly, answers become more customizable. Users aren’t forced into predefined park buckets. Their unique choices can create a unique combination of park recommendations. And finally, this model can be adjusted easily. If we want to highlight African American history for African American History month, we can assign higher weight to related keywords so parks honoring African American history and culture end up with higher scores.
Step 6: Deploy the Model
To use a model in production, it must be turned into a web application. Up until this point, my work largely existed in a Jupyter Notebook. However, the Jupyter Notebook existed solely on my local computer and wasn’t able to interact with the website where the quiz was taking place.
So, I needed to package the model into a microservice and create an application programming interface (API) to allow other services to interact with my model using a common language, typically JSON. You can think of my Jupyter Notebook as a blueprint for my model, the microservice as the house based on the blueprint, and the API as the road that connects to other houses, i.e. web infrastructures, to allow for information transfer.
With this simple web infrastructure, the quiz collects the user’s responses and sends them to my microservice using the API. My model then takes that data and calculates the top scoring parks. Next, the microservice sends a list of top parks back to the quiz via API. Finally, the quiz displays the results to the user—all of this takes place in seconds!
Step 7: Collect Feedback and Make Improvements!
If you ask 10 data scientists to build a model for a park recommendation quiz, you will likely get 10 different solutions. This exemplifies the “science” behind data science—there is no one “true” answer. Data scientists ask a question, formulate a hypothesis, and experiment!
That’s why it’s important to define metrics and collect feedback once the model has been deployed—each helps data scientists iterate and improve the model over time. We collected feedback through a simple pop-up box at the end of the quiz, asking if the user likes the results.
Other Considerations for Data Scientists and Teammates Working with Them
While the data science process is shown in incrementing steps in this article, it’s not always so linear in real life. A data scientist might not be able to obtain the right data— and have to revisit step one to find another problem.
Or, after deploying the model and collecting feedback, the data scientist may need to update the data, adjust the model, or try another model altogether. Remember, going “backwards” does not mean the data scientist is not making progress! In situations like this, it’s important to be flexible and adjust your approach to reflect new information. Always be open to other suggestions, alternate models, and other data sources.
Moreover, don’t be afraid to reach out to your team for feedback or advice on overcoming roadblocks. Data Scientists cannot do their work in a vacuum, after all. From the customer to content strategists, I worked with several different partners throughout the course of my project. My model would not be able to exist without the help of everyone on the team!