Chirag Mahapatra

Search

What are search applications?

Search is part of the majority of applications on the internet. There are some products such as Google where search is the main product while some products like Pinterest or Amazon where search is a feature used to help the user access the core value proposition of the application.

image Figure 1: Examples of search applications. Reference: Google Images

What are the business goals and metrics?

The business goals can vary depending on the context. E.g. In TikTok the goal is for the user to stay as engaged as possible which means to ensure the user continues to keep watching the videos recommended. On the other hand, Amazon wants the user to get the user to purchase as soon as possible. This can lead to optimization of different business metrics such as engagement rate or time to order rate.

Some of the common metrics for search are:

  • Click-Through Rate (CTR): This measures the percentage of users who click on one or more of the search results after performing a search. A high CTR indicates that the search results are relevant and compelling to users.
  • Average Time Spent on Search Results: This measures the amount of time users spend on the search results page after performing a search. A shorter search time indicates that users are finding the information they need quickly and easily.
  • Search Abandonment Rate: This measures the percentage of users who abandon their search before finding what they are looking for. A high abandonment rate may indicate that the search application is not providing relevant results or is difficult to use.
  • Search Conversion Rate: This measures the percentage of users who perform a search and then take a desired action, such as making a purchase or filling out a form. This is a higher bar than click through rate since this metric can help businesses understand how well their search application is driving conversions and revenue.

What are the Xs?

image Figure 2: Search application pipeline. Reference: Self created

The above diagram shows a search pipeline at a very high level. A search query is typically a user input query such as Who is the president of the US?. Then the system typically generates a number of candidate responses. The key part of this process is to rank the candidate responses to optimize for the business metric. For each pair of query and candidates, we can generate features such as:

  • Exact match: Whether there is an exact match between the query and the candidate
  • Phrase match: Whether there is a phrase match between the query and the candidate
  • Broad match: Whether some of the words in the query are broadly matched in the candidate

image Figure 3: Types of keyword matches. Reference: https://www.morevisibility.com/blogs/sem/whats-up-with-keyword-match-types-these-days.html

You can also generate features only based on the candidate such as:

  • Popularity: Number of people who have already viewed the candidate
  • Time to bounce: How much time do users typically spend on the candidate page
  • Page speed loads: How long does the page take to load
  • Author: Credibility of the author of the candidate
  • Recency: When was the candidate created?

This is a small subset of the features which would feed into the algorithm. However, these features are not devoid of problems. E.g. If you use keyword match as a feature, content creators might just create content with the most popular keywords to rank higher in the rankings. This is actually pretty common. For search engines, this process is known as search engine optimization. Companies like Google have made a number of changes to their algorithm to prevent malicious gamification of search rankings via content farming.

Popularity is another problematic feature since there is plenty of content which is popular but isn’t accurate. For a few days after the Soccer world cup in Qatar, the top result for the query Who won the most recent world cup was for France which won the world cup in 2018 instead of Argentina which won it in 2022. This is likely because of the popularity of the pages which said France.

What are the Ys?

Let us define the Y as a boolean variable indicating whether each candidate is relevant or not relevant to the query.

One key question is how can this be obtained. The first case to consider is when your application has not launched or has recently launched and there is no user data. In that case, human labeling is the best option to create the initial set of Ys. While this is an important step, it is also very important to know the potential flaws. The key one being a group of labelers hired to label are not going to be representative of your users. You might also have disagreements between the human labels. E.g. for queries such as Who is the best basketball player? there is no objective answer. Just kidding. It’s Michael Jordan.

Once you do have users, you can use their behavior to create Y. Specifically, if the users are clicking on a given content for a given query, that’s a true example. Note, that it is important to ground this with the use case. E.g. in Amazon’s case, you need to care about whether the user bought the item after clicking. In TikTok’s case, you would probably want to factor in whether the user swiped away immediately after starting the video or saw the entire video.

How would you create a training dataset?

Now that we have defined our Xs and Ys, we can create our training dataset. We would get the Xs from the historical record of queries and the potential candidates. We would get the Ys from the user behavior logged at that time. This would be a bit tricky when the application is starting at the beginning. You would have to manually bootstrap the dataset. Websites like Quora manually created their initial question and answer set and queries for their search applications.

While creating the training dataset for the search application, it is important to keep in mind some of the common biases:

  • Selection bias: This occurs when the dataset is not representative of the population or domain that the search engine is intended to operate in. For example, if the queries the search application is trained on are not representative of the queries the search application will see in deployment.
  • Confirmation bias: This occurs when the training dataset is biased towards certain perspectives or sources of information, leading to a skewed view of what is considered relevant or important. For example, if the training dataset primarily includes data from certain websites or sources, the search engine may prioritize those sources over others.
  • Label bias: This occurs when the labels or annotations used to train the model are themselves biased or subjective. For example, if the labels are based on human judgments, they may reflect human biases or preferences rather than objective measures of relevance.
  • Temporal bias: This occurs when the training dataset is not representative of the current state of the domain or population being searched. For example, if the training dataset is several years old, it may not reflect changes in language use or search patterns that have occurred since then.
  • Demographic bias: This occurs when the training dataset is biased towards certain demographics, such as age, gender, or ethnicity, leading to a skewed view of what is considered relevant or important for different groups of people.

How would you create an evaluation dataset?

The evaluation dataset is a different dataset which is hidden during training. This is to provide an unbiased evaluation of the search algorithm. The sources for this dataset will be the same as the training dataset. It is key to ensure that there is no overlap between the two datasets and the distribution should be the same as that during deployment.

The best practice would be to keep refreshing the training and evaluation dataset with examples in deployment on a regular basis so that the algorithm is always improving with new examples.

What should you keep in mind during deployment?

The interface used during deployment is critical because that influences the search queries. A feature which strongly influences search queries is Auto Complete. While this is intended to be a beneficial feature, it can also propagate bias.

image Figure 4: Auto Complete results on climate change is. Reference: https://www.theguardian.com/technology/2016/dec/16/google-autocomplete-rightwing-bias-algorithm-political-propaganda

The image above is an example where the auto complete results of climate change is can propagate bias. Note that these are not the results of search queries but text completion of common search queries. This would influence future queries as well.

Another big challenge is multiple objective functions. E.g. Earlier Google was focused on getting the user to the information they needed as soon as possible. However, over time Google started optimizing for time spent on their page.

image Figure 5: Golden State Warriors search results. Reference: Google

This led them to add a number of modules on the search results page such as the recent scores, stories, videos, info panel which makes the user stay on the page. While this is not bad on its own, it does change the original objective function and subsequently the dataset.

It is important to keep in mind the influence the product has on the dataset created and the potential outcomes.

Summary

In this article we cover the end to end steps of creating a dataset for a search application, from defining the concepts to identifying potential errors to deployment. There are a number of things to consider while doing so to ensure that one avoids the common pitfalls and creates an application beneficial to the end user and the business.