How to build a search engine with word embeddings

First published on May 26, 2021

Last updated at November 24, 2021

 

19 minute read

Chris Zhu

TLDR

A dual space embedding approach to creating query and document embeddings for embedding based ranking can offer more contextually relevant results than the traditional word embedding approach.

Imagine you have a large set of news articles, and want to provide a way to allow your users to search through them. For example, here are 5 random articles from the Economist:

  1. An investment bonanza is coming

  2. Who governs a country’s airspace?

  3. What is a supermoon, and how noticeable is it to the naked eye?

  4. What the evidence says about police body-cameras

  5. Who controls Syria?

Let’s think about what kind of queries potential users issue and the articles that we would want to return.

Exact and Substring Matches

  • A search for “investment” should return the 1st article.

  • A search for “supermoon” should return the 3rd article.

  • A search for “syria” should return the 5th article.

This is quite simple to implement with traditional information retrieval techniques, since these queries have keyword matching.

Harder Examples

  • A search for “banking” should probably return the 1st article.

  • A search for “astrology” should return the 3rd article.

  • A search for “middle east” should return the 5th article.

The challenge with these queries is that none of them are actually a keyword match in the original documents. However, there isn’t any string matching we can perform between the search query and the document.

What we need is a way to perform searches over text without requiring exact keyword matches between words in the query and words in the documents.

Word Embeddings

One way we can do this is with word embeddings, which

.

If we can find a way to create a vector for the query, and a vector for every document, in the same vector space, then we can just compute the similarity between our query vector, and every document, to retrieve the document that is most semantically similar to our query.

Creating Document Embeddings from Word Embeddings

Assuming we have lookup table of a word to its embedding, how do we actually create an embedding representing a string of words?

We can think of the semantics of a document as just the average of the semantics of individual words, and compute a mean word embedding to represent a document.

Specifically:

1
2
3
4
def create_mean_embedding(words):
    return np.mean(
        [model[word] for word in words if word in model],
        axis=0)

This would capture the average semantic of a document or query.

Another way we can build a document embedding is by by taking the coordinate wise max of all of the individual word embeddings:

1
2
3
4
def create_max_embedding(words, model):
    return np.amax(
        [model[word] for word in words if word in model],
        axis=0)

This would highlight the max of every semantic dimension.

We can also concatenate the two embedding types to get a better representation of the document:

1
2
3
4
5
def create_embedding(words):
    return np.concatenate([
        create_max_embedding(words),
        create_mean_embedding(words)
    ])

Let’s try apply this to the simple example documents we highlighted above.

We will be using 

to load our 

. Find the code for this 

.

Ranking with embeddings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Query: "banking"

Rank: 0 | Top Article: An investment bonanza is coming
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 3 | Top Article: Who controls Syria?
Rank: 4 | Top Article: What the evidence says about police body-cameras

Query: "astrology"

Rank: 0 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: Who controls Syria?
Rank: 3 | Top Article: What the evidence says about police body-cameras
Rank: 4 | Top Article: An investment bonanza is coming

Query: "middle east"

Rank: 0 | Top Article: Who controls Syria?
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: An investment bonanza is coming
Rank: 3 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 4 | Top Article: What the evidence says about police body-cameras

Dual space embedding

So far we’ve used the standard pre-trained word2vec embeddings to represent both the query and document embeddings. But lets think about how word2vec constructs these vectors to come up with a more clever way to create these embeddings.

The core idea of word2vec is that words that are used in similar contexts, will have similar meanings. Focusing on the skip-gram variant of the model, this idea is formulated as trying to predict the context words from any given word. Check out

if this is confusing to you.

In a skip-gram model, a one-hot encoded word is mapped to a dense representation, which then is mapped back to the context of that word (context in this case refers to the words that surround the target word). Usually, we keep the first transformation matrix

(in the diagram below), which becomes our word embeddings, and discard the matrix that maps to context words

W’.

If we think deeply about the behavior of W and W’ matrix: the W embedding’s goal is to map a word to an intermediate representation, and the W’ matrix maps from the intermediate representation to a context. This naturally results in W capturing the semantics of a word, while W’ captures the context of a word.

We can frame our search problem in a similar way: The query being the “word” and the document being the “context”. In this way, if the query is embedded using W, while the document is embedded using W’, comparing the similarity between the two vectors would indicate 

whether or not the document acts as a good context for the query.

Lets consider two simple documents, that we could potentially return for a query “harvard”, and decide which one is more relevant:

Document 1:

1
Harvard alumni receives

Document 2:

1
Yale, NYU and Harvard 

As a user, you’d probably want to see the first document. However, our current embedding model would assign the second document a higher relevance score. Why? Because the terms “Pittsburg Steelers”, “Baltimore Ravens” and “Cincinnati Bengals” are similar to the term “New England Patriots”, since they are all football teams. (Recall that the word2vec model measures similarity as words that have similar contexts. Football team names would exists in similar context). Therefore, the embedding for the query and document 2 will be more similar than the query and document 1.

However, using 

W’

to create the document embedding, and W to create the query embedding, will allow a similarity measure of “

how well does this document represent the 

context

of this query”.

Indeed, these are the results published in

.

This table above shows the closest word embeddings to the words “yale” with three different embedding comparisons.

  1. IN-IN: The bolded word is embedded with W and the words below are the closest word embeddings in W.

  2. OUT-OUT: The bolded word is embedded with W’ and the words below are the closest word embeddings in W’.

  3. IN-OUT: The bolded word is embedded with W and the words below are the closest word embeddings in W’.

Observe that while “Yale” is more semantically similar to “Harvard”, if a document about Harvard came up when searching “Yale”, that wouldn’t make a lot of sense.

With that in mind, it seems the IN-OUT embedding scheme provides the most relevant words for a search system. If someone searched for the word “Yale”, I would expect documents with the terms “faculty”, “alumni”, and “orientation” rather than “harvard” and “nyu”.

How do we deploy this?

Too many ML articles stop here, but deployment is typically half the challenge.

Typically, these models are deployed as a ranker after a retrieval model. The pipeline is typically structured like this:

The purpose of the retrieval model is to find the ~100 most similar documents based on keywords among millions of other documents, while the purpose of the ranking model is to find the most semantically relevant document in the set of 100.

BM25 is a common document retrieval method which is implemented in

.

First, lets implement a retrieval model that leverages the

package. In a production setting, we would probably use something like Elasticsearch for this part.

from rank_bm25 import BM25Okapi as BM25

1
2
3
4
5
6
7
8
9
class Retriever(object):
    def __init__(self, documents):
        self.corpus = documents
        self.bm25 = BM25(self.corpus)

    def query(self, tokenized_query, n=100):
        scores = self.bm25.get_scores(tokenized_query)
        best_docs = sorted(range(len(scores)), key=lambda i: -scores[i])[:n]
        return best_docs, [scores[i] for i in best_docs]

Next, let’s build the ranking model, which takes a list of documents and re-ranks them according to the similarity between the query embedding and the document embedding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np

class Ranker(object):
    def __init__(self, query_embedding, document_embedding):
        self.query_embedding = query_embedding
        self.document_embedding = document_embedding

    def _embed(self, tokens, embedding):
        embedding = np.mean(
            np.array([embedding[token] for token in tokens if token in embedding]),
            axis=0,
        )
        unit_embedding = embedding / (embedding**2).sum()**0.5
        return unit_embedding

    def rank(self, tokenized_query, tokenized_documents):
        """
        Re-ranks a set of documents according to embedding distance
        """
        query_embedding = self._embed(tokenized_query, self.query_embedding) # (E,)
        document_embeddings = np.array([self._embed(document, self.document_embedding) for document in tokenized_documents]) # (N, E)
        scores = document_embeddings.dot(query_embedding)
        index_rankings = np.argsort(scores)[::-1]
        return index_rankings, np.sort(scores)[::-1]

And now, let’s stitch the code together.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@click.command()
@click.option("--path", prompt="Path to document TSV", help="Document TSV")
@click.option("--query", prompt="Search query", help="Search query")
def main(path, query):
    documents = [doc for doc in reader.corpus]
    corpus = [list(gensim.utils.tokenize(doc.lower())) for doc in documents] # Tokenize
    tokenized_query = tokenize(query)

    retriever = Retriever(corpus)
    retrieval_indexes, retrieval_scores = retriever.query(tokenized_query) # BM25 rankings

    retrieved_documents = [documents[idx] for idx in retrieval_indexes]

    tokenzed_retrieved_documents = [corpus[idx] for idx in retrieval_indexes]

    query_embedding = api.load('glove-wiki-gigaword-50')
    ranker = Ranker(query_embedding=query_embedding, document_embedding=query_embedding)
    ranker_indexes, ranker_scores = ranker.rank(tokenized_query, tokenzed_retrieved_documents) # Re-rank original documents
    reranked_documents = [retrieved_documents[idx] for idx in ranker_indexes]

Let’s take a look at a real example output of this for the query

“harvard hockey”

against the Microsoft News Dataset. However, there are no documents in the corpus containing both the terms “harvard” and “hockey”, making this a tricky problem.

Here are the top 20 results for the query

“harvard hockey” 

from the BM25 model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
======== RANK: 1 | SCORE: 10.502162012723257 =======
Admissions scandal unfolding at Harvard University

======== RANK: 2 | SCORE: 9.980622270107766 =======
How a Harvard class project changed barbecue

======== RANK: 3 | SCORE: 9.980622270107766 =======
Then & Now: Whitcomb House, 51 Harvard St., Worcester

======== RANK: 4 | SCORE: 9.508431413948806 =======
Girl recreates photo with Harvard cop 15 years later

======== RANK: 5 | SCORE: 9.508431413948806 =======
Queen Latifah to receive Harvard black culture award

======== RANK: 6 | SCORE: 9.508431413948806 =======
Photos: Elevator at Harvard Square MBTA station reopens

======== RANK: 7 | SCORE: 9.508431413948806 =======
Out Of Town News In Harvard Square Closing

======== RANK: 8 | SCORE: 9.508431413948806 =======
Vegan Cheesesteaks Arrive in Harvard Square This Weekend

======== RANK: 9 | SCORE: 9.508431413948806 =======
MIT, Harvard Professors Share Nobel Prize In Economics

======== RANK: 10 | SCORE: 9.078901648531026 =======
Harvard students triggered after newspaper asks ICE for comment

======== RANK: 11 | SCORE: 9.078901648531026 =======
Yale-Harvard football game to kick off at noon

======== RANK: 12 | SCORE: 8.686501394096426 =======
Harvard Was 'Freaking Out': How a $270 Million Brazil Bet Tanked

======== RANK: 13 | SCORE: 8.686501394096426 =======
Harvard student newspaper faces backlash for requesting comment from ICE

======== RANK: 14 | SCORE: 8.686501394096426 =======
Harvard student government sides with anti-ICE protesters against newspaper

======== RANK: 15 | SCORE: 8.326615774135071 =======
Watch: Dartmouth stuns Harvard on game-winning, tip-drill Hail Mary

======== RANK: 16 | SCORE: 8.326615774135071 =======
It's now or never for Harvard basketball's senior class

======== RANK: 17 | SCORE: 8.158138015219139 =======
No. 2 Maryland field hockey vs American preview

======== RANK: 18 | SCORE: 8.158138015219139 =======
Minnesota Duluth completes hockey sweep of Gophers

======== RANK: 19 | SCORE: 8.158138015219139 =======
Lightning beat "Bruins hockey", win 4-3 in shootout

======== RANK: 20 | SCORE: 8.158138015219139 =======
MM 11.10: Field hockey NCAA Tournament seeding announced

The documents with the highest score from the BM25 retrieval model is

“Admissions scandals unfolding at Harvard university” 

and

“How a Harvard class project changed barbecue”

I suspect many people do not think this is a very relevant article to

“harvard hockey”

But if we re-rank the top 100 results from the BM25 model with the embedding based model, here is what we get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
======== Embedding ========
======== RANK: 1 | SCORE: 0.7188475728034973 =======
ASU hockey completes first sweep of top-10 team

======== RANK: 2 | SCORE: 0.7149215936660767 =======
Field hockey: College commitments for the Class of 2020

======== RANK: 3 | SCORE: 0.7103195190429688 =======
New team at No. 1 in Michigan college hockey power rankings

======== RANK: 4 | SCORE: 0.7018747329711914 =======
Minnesota Duluth completes hockey sweep of Gophers

======== RANK: 5 | SCORE: 0.7000885009765625 =======
Lightning beat "Bruins hockey", win 4-3 in shootout

======== RANK: 6 | SCORE: 0.6870558261871338 =======
Yale-Harvard football game to kick off at noon

======== RANK: 7 | SCORE: 0.6836330890655518 =======
In Boston, the first trans hockey team takes the ice

======== RANK: 8 | SCORE: 0.6813685894012451 =======
Field hockey: Wildcats thrash Indiana as they head into postseason play

======== RANK: 9 | SCORE: 0.6806915998458862 =======
Palmyra field hockey, girls volleyball headed to state finals

======== RANK: 10 | SCORE: 0.6806261539459229 =======
Soccer, Volleyball, Field Hockey State Tournament Pairings Set

======== RANK: 11 | SCORE: 0.6752084493637085 =======
Gophers hockey gears up for high-scoring Penn State

======== RANK: 12 | SCORE: 0.6729203462600708 =======
Michigan college hockey power rankings: Rough start for most teams

======== RANK: 13 | SCORE: 0.6702247262001038 =======
College hockey heavyweights Denver, Minnesota Duluth set to face off

======== RANK: 14 | SCORE: 0.6696025729179382 =======
No. 2 Maryland field hockey weekend preview: No. 9 Michigan and No. 21 Ohio State

======== RANK: 15 | SCORE: 0.664783239364624 =======
Wethersfield field hockey wins its first state tournament game in 34 years

======== RANK: 16 | SCORE: 0.660230278968811 =======
St. Cloud State hockey picks up overtime win over Northern Michigan

======== RANK: 17 | SCORE: 0.6559702157974243 =======
Jenuwine scores first college goal, helping ASU hockey to sweep Air Force

======== RANK: 18 | SCORE: 0.6555651426315308 =======
COLLEGE HOCKEY: Salve Regina primed for another successful season

======== RANK: 19 | SCORE: 0.6528381109237671 =======
Kellam makes first appearance at VHSL state field hockey tournament

======== RANK: 20 | SCORE: 0.6525543928146362 =======
No. 2 Maryland field hockey falls to No. 9 Michigan, 1-0, in Ann Arbor

Now the results are much more semantically relevant to college hockey results.

Summary

A dual space embedding approach to creating query and document embeddings for embedding based ranking can offer more contextually relevant results than the traditional word embedding approach.

References