将1个输入句子与1个给定句子通过相似性映射。

huangapple go评论63阅读模式
英文:

Map (1:1) N input sentences to N given sentences by similarity

问题

I want to map N_a input sentences to N_b given sentences so that the mapping is one-to-one. That is,

  • Every N_a is assigned
  • No N_a appears more than once

Unfortunately the inputs vary slightly over time. Here is a representative mapping:

{ "typeA": "step1 typeA (before typeB)"
, "typeD": "type D"
, "actionA": "actionA-suffix: typeB or Type D (type E available)"
, "typeE": "typeE - (not-actionA)"
, "actionB": "actionB some descriptive words"
, "typeA subtypeA": "subtypeA typeA or typeX - (not for typeB)"
, "actionA subtypeA": "actionA-suffix: subtypeA (type E available)"
, "typeB subtypeA": "subtypeA typeB"
, "typeC subtypeA": "subtypeA typeC"
, "typeB": "typeB (not subtypeA)"
, "typeF": "get typeF or typeF-subtypeA"
, "typeF actionB": "actionB typeF or typeF subtypeA"
}

Following [1], I've created this workflow using the sentence_transformers package[2]:

  • BERT -> mean-pool -> cosine similarity

Given the inputs it is clear that string-based alignment plus edit-distance (of the type featured in rapidfuzz[3]) won't work. But I'm not sure if BERT is the best approach, or if it is overkill for my needs. Perhaps I should use a word embedding model (word2vec, glove, etc) rather than a sentence embedding model. For this particular task I wonder if BERT could be tweaked to perform better.

The categories aren't well differentiated so that BERT sometimes maps an input to more than one given. An input N_a can be the best match for multiple givens N_b0, ..., N_bi. In such cases I keep the map with the best score and for the others fall upon the 2nd-best, 3rd-best, ... map. How can I improve BERT's performance to avoid these duplicates?


Current implementation below:

import pandas as pd
import sentence_transformers as st

model = st.SentenceTransformer('all-mpnet-base-v2')

sentences_input = [ ... ]
sentences_given = [ ... ]

# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model encode(sentences_given)

# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)

# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
  (columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
  for j, sinput in enumerate(sentences_input):
    df.loc[sgiven, sinput] = cos_similarity[j,i].item()

# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)

# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
  if not sentences:
    break
  if sgiven not in sentences or sinput not in sentences:
    continue
  mapping_good[sgiven] = sinput
  sentences.remove(sgiven)
  sentences.remove(sinput)

# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)
  1. BERT For Measuring Text Similarity
  2. SBERT: Semantic Textual Similarity
  3. rapidfuzz
英文:

I want to map N_a input sentences to N_b given sentences so that the mapping is one-to-one. That is,

  • Every N_a is assigned
  • No N_a appears more than once

Unfortunately the inputs vary slightly over time. Here is a representative mapping:

{ "typeA":            "step1 typeA (before typeB)"
, "typeD":            "type D"
, "actionA":          "actionA-suffix: typeB or Type D (type E available)"
, "typeE":            "typeE - (not-actionA)"
, "actionB":          "actionB some descriptive words"
, "typeA subtypeA":   "subtypeA typeA or typeX - (not for typeB)"
, "actionA subtypeA": "actionA-suffix: subtypeA (type E available)"
, "typeB subtypeA":   "subtypeA typeB"
, "typeC subtypeA":   "subtypeA typeC"
, "typeB":            "typeB (not subtypeA)"
, "typeF":            "get typeF or typeF-subtypeA"
, "typeF actionB":    "actionB typeF or typeF subtypeA"
}

Following [1], I've created this workflow using the sentence_transformers package[2]:

  • BERT -> mean-pool -> cosine similarity

Given the inputs it is clear that string-based alignment plus edit-distance (of the type featured in rapidfuzz[3]) won't work. But I'm not sure if BERT is the best approach, or if it is overkill for my needs. Perhaps I should use a word embedding model (word2vec, glove, etc) rather than a sentence embedding model. For this particular task I wonder if BERT could be tweaked to perform better.

The categories aren't well differentiated so that BERT sometimes maps an input to more than one given. An input N_a can be the best match for multiple givens N_b0, ..., N_bi. In such cases I keep the map with the best score and for the others fall upon the 2nd-best, 3rd-best, ... map. How can I improve BERT's performance to avoid these duplicates?


Current implementation below:

import pandas as pd
import sentence_transformers as st

model = st.SentenceTransformer('all-mpnet-base-v2')

sentences_input = [ ... ]
sentences_given = [ ... ]

# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model.encode(sentences_given)

# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)

# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
  (columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
  for j, sinput in enumerate(sentences_input):
    df.loc[sgiven, sinput] = cos_similarity[j,i].item()

# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)

# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
  if not sentences:
    break
  if sgiven not in sentences or sinput not in sentences:
    continue
  mapping_good[sgiven] = sinput
  sentences.remove(sgiven)
  sentences.remove(sinput)

# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)
  1. BERT For Measuring Text Similarity
  2. SBERT: Semantic Textual Similarity
  3. rapidfuzz

答案1

得分: 1

这里不需要任何复杂的自然语言处理。这看起来像一个相对简单的分配问题。对于字符串ab,分别属于a_listb_list,定义score(a, b)ab共有的单词数量,然后解决分配问题以最大化得分。

下面的代码并没有完全解决分配问题,而是返回了一种贪婪解决方案,但它仍然找到了最佳匹配:

def score(a, b):
    b_words = b.replace(' - ', ' ').split()
    return sum((w in b_words) for w in a.split())

def greedy_matching(a_list, b_list):
    b_available = set(b_list)
    for a in sorted(a_list, key=len, reverse=True):
        b = max(b_available, key=lambda b: score(a.lower(), b))
        yield (a, b)
        b_available.remove(b)

a_list = ['typeC subtypeA', 'actionA', 'typeD', 'typeE', 'typeF', 'actionA subtypeA', 'typeB', 'typeF actionB', 'typeB subtypeA', 'actionB', 'typeA subtypeA', 'typeA']

b_list = ['typeB (not subtypeA)', 'typeE - (not-actionA)', 'actionB some descriptive words', 'get typeF or typeF-subtypeA', 'type D', 'subtypeA typeA or typeX - (not for typeB)', 'step1 typeA (before typeB)', 'subtypeA typeC', 'actionA-suffix: subtypeA (type E available)', 'subtypeA typeB', 'actionB typeF or typeF subtypeA', 'actionA-suffix: typeB or Type D (type E available)']

b_list = 
展开收缩
pairs = list(greedy_matching(a_list, b_list)) print(*pairs, sep='\n')

输出:

('actionA subtypeA', 'actiona-suffix: subtypea (typee available)')
('typeC subtypeA', 'subtypea typec')
('typeB subtypeA', 'subtypea typeb')
('typeA subtypeA', 'subtypea typea or typex - (not for typeb)')
('typeF actionB', 'actionb typef or typef subtypea')
('actionA', 'actiona-suffix: typeb or typed (typee available)')
('actionB', 'actionb some descriptive words')
('typeD', 'typed')
('typeE', 'typee - (not-actiona)')
('typeF', 'get typef or typef-subtypea')
('typeB', 'typeb (not subtypea)')
('typeA', 'step1 typea (before typeb)')
英文:

There is no need for any fancy NLP at all. This looks like a relatively straightforward assignment problem. For a string a of a_list and a string b of b_list, define score(a, b) to be the number of words in common of a and b, and then solve the assignment problem to maximise the scores.

Below I didn't even properly solve the assignment problem, and instead returned a greedy solution, and yet it still found the best matches:

def score(a, b):
    b_words = b.replace('-', ' ').split()
    return sum((w in b_words) for w in a.split())

def greedy_matching(a_list, b_list):
    b_available = set(b_list)
    for a in sorted(a_list, key=len, reverse=True):
        b = max(b_available, key=lambda b: score(a.lower(), b))
        yield (a, b)
        b_available.remove(b)

a_list = ['typeC subtypeA', 'actionA', 'typeD', 'typeE', 'typeF', 'actionA subtypeA', 'typeB', 'typeF actionB', 'typeB subtypeA', 'actionB', 'typeA subtypeA', 'typeA']

b_list = ['typeB (not subtypeA)', 'typeE - (not-actionA)', 'actionB some descriptive words', 'get typeF or typeF-subtypeA', 'type D', 'subtypeA typeA or typeX - (not for typeB)', 'step1 typeA (before typeB)', 'subtypeA typeC', 'actionA-suffix: subtypeA (type E available)', 'subtypeA typeB', 'actionB typeF or typeF subtypeA', 'actionA-suffix: typeB or Type D (type E available)']

b_list = 
展开收缩
pairs = list(greedy_matching(a_list, b_list)) print(*pairs, sep='\n')

Output:

('actionA subtypeA', 'actiona-suffix: subtypea (typee available)')
('typeC subtypeA', 'subtypea typec')
('typeB subtypeA', 'subtypea typeb')
('typeA subtypeA', 'subtypea typea or typex - (not for typeb)')
('typeF actionB', 'actionb typef or typef subtypea')
('actionA', 'actiona-suffix: typeb or typed (typee available)')
('actionB', 'actionb some descriptive words')
('typeD', 'typed')
('typeE', 'typee - (not-actiona)')
('typeF', 'get typef or typef-subtypea')
('typeB', 'typeb (not subtypea)')
('typeA', 'step1 typea (before typeb)')

huangapple
  • 本文由 发表于 2023年3月8日 18:51:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672080.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定