Notice: This website is mostly outdated as of 2024. A new website is coming soon. Proceed with caution regarding earlier posts.

Semantic Similarity Search for Phrases

7 minute read

Word vector averaging is a way to find semantically similar sentences. The idea is simple: find the word embedding for each word using an algorithm like word2vec or GloVe, average the embeddings together to get a sentence vector, and match sentences with the most similar sentence vectors based off Euclidean distance or cosine similarity.

I recently became interested in extending the idea to the phrase level. If we have two sentences, can we co-locate all the semantically similar phrases from each sentence?

For example, if I have the following two sentences:

Sentence A: Yucaipa owned Dominick’s before selling the chain to Safeway in 1998 for $2.5 billion.

Sentence B: Yucaipa bought Dominick’s in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

I want my algorithm to return similar phrases:

“Yucaipa owned Dominick’s” / “Yucapia bought Dominick’s”

“selling the chain to Safeway” / “sold it to Safeway”

“in 1998 for $2.5 billion” / “for $1.8 billion in 1998”

To accomplish this, I wrote an algorithm with the following steps:

1) Parse the sentence into phrases using a statistical dependency parser.

2) Create a word vector average for each phrase using word2vec embeddings.

3) Match the most similar phrases based off closest Euclidean distance between word vector averages.

I’ve explained the steps with code in the blog post below, and you can find a Jupyter notebook here.

Load the data

Let’s test our algorithm on the Microsoft Research Paraphrase Corpus, which contains pairs of paraphrased sentences extracted from news articles. We’ll create a list of paraphrase lists:

def load_data(n=5):
  '''
  Function to get spacy model with medium neural net with the constituency parsing extension
  described in "Constituency Parsing with a Self-Attentive Encoder" (2018)

  Args:
      n (int): Number of paraphrase pairs to load.

  Returns:
      new_list: A nested list of n paraphrase pairs.
  '''
  target_url = 'https://raw.githubusercontent.com/wasiahmad/paraphrase_identification/master/dataset/msr-paraphrase-corpus/msr_paraphrase_data.txt'
  i = 0; data = []
  for sentence in urllib.request.urlopen(target_url):
      # skip first sentence
      if i == 0: i += 1; continue
      sentence = sentence.decode()
      sentence =  re.split(r'\t+', sentence)
      data.append(sentence[1])
      # increment counter for number of data
      i += 1
      if i > n*2: break 
    # turn into nested list
  new_list = []
  for i in range(0, len(data)-1, 2):
    new_list.append([data[i], data[i+1]])

  print("Data loaded")

  return new_list

We’ll also load a medium spaCy model pretrained on English text:

def get_model():
  '''
  Function to get spacy model with medium neural net.

  Args:
      None

  Returns:
      nlp: A loaded model with constituency parsing functionality.
  '''
  nlp = en_core_web_md.load()
  print("Model loaded")
  return nlp 

Load phrase parser

The first step in solving this problem is parsing the sentences into phrases. We can use the Stanford Statistical Parser, which we can download here to create a dependency parser that turns our sentences into phrases based off pre-trained probabilistic dependencies.

def load_parser():
  '''
  Function to load Stanford parser,

  Args:
      None

  Returns:
      parser: return parser object
  '''
  path_exists = os.path.exists('/content/stanford-parser-full-2018-10-17') ## put your path to the downloaded parser here
  if path_exists:
    print(True)
  else:
    print("Load and configure StanfordParser")
    !wget https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
    !unzip stanford-parser-full-2018-10-17.zip

  stanford_parser_dir = '/content/stanford-parser-full-2018-10-17'
  path_to_models = stanford_parser_dir  + "/stanford-parser-3.9.2-models.jar"
  path_to_jar = stanford_parser_dir  + "/stanford-parser.jar"
  parser=StanfordParser(path_to_models, path_to_jar)
  return parser

Using the parser, we can create a tree traversal object so that we can load the sentences into the parser and then traverse the tree to extract relevant phrases. The traversal object gathers both noun and prepositional phrases.

class Traverse():
  '''
  Traverse object to create trees to find noun phrases.
  To use, call traverse_tree() with input from the StanfordParser
  Then call the phrase_strings() function with self.phrases to get the noun phrases of
  the input sentence.
  '''
  def __init__(self):
    self.phrases = []
    
  def traverse_phrase(self, tree, phrases): 
      for subtree in tree:
          if type(subtree) == nltk.tree.Tree:
              self.traverse_phrase(subtree, phrases)
          else:
              phrases.append(subtree)

  # traverse the tree to gather noun phrases and prepositional phrases
  def traverse_tree(self, tree):
      for subtree in tree:
          if type(subtree) == nltk.tree.Tree:
              if subtree.label() == 'NP' or subtree.label() == 'PP':
                  self.traverse_phrase(subtree, self.phrases)
                  self.phrases.append('\n')
              else :
                  self.traverse_tree(subtree)

  # put noun phrases in list
  def phrase_strings(self, phrase_list):
    a = " ".join(phrase_list).split("\n")
    a = [i.strip() for i in a if i]
    return a

Let’s write a function to calculate a semantic measure for the phrase.

Now we need a metric to calculate the semantic measure of each phrase. I calculate the word vector of each word of the phrase and then average the words together. Notably, I did not normalize the word vectors before averaging because the different lengths give rise to a weighted average, which Arefyev et al suggests is more accurate.

While research regarding weighted vs unweighted word vector averages is scarce, one theory is that infrequent words are correlated with longer word vectors. Infrequent words may be more poorly represented by the embedding, which is based off distributional frequency, and so having a larger value in the total average may make the overall phrase vector more accurate.

def wva(string):
    '''
    Finds document vector through an average of each word's vector.

    Args: 
      string (str): Input sentence

    Returns:
      array: Word vector average
    '''
    doc = nlp(string)
    wvs = np.array([doc[i].vector for i in range(len(doc))])
    return np.mean(wvs, axis=0) 

Now we will calculate the phrases with the greatest semantic similarity. The most common measures of semantic similarity are cosine similarity and Euclidean distance. Perhaps unconventionally, I chose the latter in order to capture the information in the word average lengths, which Arefyez et al. suggest is significant in representing word frequency.

def match_wv_pair(phrasesA, phrasesB):
  '''
  Takes two lists of phrases from one sentence each and finds the smallest Euclidean distance for each pair's word vector (non-exclusive).

  Args:
  phraseA (list of str): List of parsed phrases from or sentence A
  phraseB (list of str): List of parsed phrases from sentence B to compare with sentence A 

  Returns:
  matches (list of str): Returns list of matches between the two phrase (surjectively, i.e. multiple phrases can have the same match).

  '''
  # get word vectors
  wva_a = []; wva_b = []
  for i in phrasesA:
    wva_a.append(wva(i))
  for j in phrasesB:
    wva_b.append(wva(j))

  # swap so that shortest is on the outer for loop
  if len(wva_a) > len(wva_b):
    temp = wva_a
    wva_a = wva_b
    wva_b = temp

    temp = phrasesA
    phrasesA = phrasesB
    phrasesB = temp

  matches = []
  for i in range(len(wva_a)):
    distances = []
    for j in range(len(wva_b)):
      distances.append(numpy.linalg.norm(wva_a[i] - wva_b[j]))
      # indices_total.append(np.argsort(distances)[0])
    matches.append("Sentence A: " + phrasesA[i] + "\n Sentence B:" + phrasesB[np.argsort(distances)[0]] + "\n Euclidean Distance:" + str(np.sort(distances)[0]) + "\n")

  return matches

Run the final function

Finally, I wrote a function that runs the functions above: take in our data and parser, parse the sentences into phrases, turn the phrases into word vector averages, then match word vector averages based on Euclidean distance. The function prints out semantically similar phrases for each paraphrase-pair of the original dataset.

def find_similar_phrases(data, parser):
  '''
  Uses Traverse object to create phrase trees of each sentence, then recurses through tree to collect noun phrases.

  Args: 
  string (list of strings): List of string lists in the format [[a,b],[c,d]] to find similarity between each pair.

  Returns:
  Nothing (prints out original sentences, a phrases, b phrases, matching phrases, and their Euclidean distance)
  '''
  for a, b in data:
    print("Original sentences:")
    print("Sentence A: ", a)
    print("Sentence B: ", b)
    print()

    ta = Traverse()
    phrasesA = parser.raw_parse(a)
    ta.traverse_tree(phrasesA)
    a = ta.phrases
    a = ta.phrase_strings(a)
    
    tb = Traverse()
    phrasesB = parser.raw_parse(b)
    tb.traverse_tree(phrasesB)
    b = tb.phrases
    b = tb.phrase_strings(b)
    
    print("A phrases:", a)
    print("B phrases:", b)
    print()

    matches = match_wv_pair(a,b)
    print("Similar phrases:")
    for i in matches:
      print(i)
    print()
    print()

The function returns:


Original sentences:
Sentence A:  Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.
Sentence B:  Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

A phrases: ['Yucaipa', 'Dominick', 'before selling the chain to Safeway in 1998 for $ 2.5 billion']
B phrases: ['Yucaipa', "Dominick 's in 1995", 'for $ 693 million', 'it', 'to Safeway', 'for $ 1.8 billion in 1998']

Similar phrases:
Sentence A: Yucaipa
Sentence B:Yucaipa
Euclidean Distance:0.0

Sentence A: Dominick
Sentence B: Dominick 's in 1995
Euclidean Distance:5.5339174

Sentence A: before selling the chain to Safeway in 1998 for $ 2.5 billion
Sentence B: for $ 1.8 billion in 1998
Euclidean Distance: 2.0490212


Original sentences:
Sentence A:  They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.
Sentence B:  On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale.

A phrases: ['They', 'an advertisement on the Internet on June 10', 'the cargo for sale', 'he']
B phrases: ['On June 10', "the ship 's owners", 'an advertisement on the Internet', 'the explosives', 'for sale']

Similar phrases:
Sentence A: They
Sentence B: the ship 's owners
Euclidean Distance: 4.217091

Sentence A: an advertisement on the Internet on June 10
Sentence B: an advertisement on the Internet
Euclidean Distance: 1.383867

Sentence A: the cargo for sale
Sentence B: for sale
Euclidean Distance: 2.6837785

Sentence A: he
Sentence B: the ship 's owners
Euclidean Distance:5.162956

To improve the function, we could train a dependency parser on the corpus to extract better phrases. We could also experiment with other ways of traversing the existing tree and trying other similarity measures, including cosine similarity, Mahalanobis distance, and relaxed word mover’s distance.

I am also interested in experimenting with accuracy in normalized vs. unnormalized word vector averages and the relationship between vector length and word frequency.

I hope this tutorial was helpful to you! The full code for this post is available on Github here.

Leave a comment