英文:
Map (1:1) N input sentences to N given sentences by similarity
问题
I want to map N_a
input sentences to N_b
given sentences so that the mapping is one-to-one. That is,
- Every
N_a
is assigned - No
N_a
appears more than once
Unfortunately the inputs vary slightly over time. Here is a representative mapping:
{ "typeA": "step1 typeA (before typeB)"
, "typeD": "type D"
, "actionA": "actionA-suffix: typeB or Type D (type E available)"
, "typeE": "typeE - (not-actionA)"
, "actionB": "actionB some descriptive words"
, "typeA subtypeA": "subtypeA typeA or typeX - (not for typeB)"
, "actionA subtypeA": "actionA-suffix: subtypeA (type E available)"
, "typeB subtypeA": "subtypeA typeB"
, "typeC subtypeA": "subtypeA typeC"
, "typeB": "typeB (not subtypeA)"
, "typeF": "get typeF or typeF-subtypeA"
, "typeF actionB": "actionB typeF or typeF subtypeA"
}
Following [1], I've created this workflow using the sentence_transformers
package[2]:
- BERT -> mean-pool -> cosine similarity
Given the inputs it is clear that string-based alignment plus edit-distance (of the type featured in rapidfuzz
[3]) won't work. But I'm not sure if BERT is the best approach, or if it is overkill for my needs. Perhaps I should use a word embedding model (word2vec, glove, etc) rather than a sentence embedding model. For this particular task I wonder if BERT could be tweaked to perform better.
The categories aren't well differentiated so that BERT sometimes maps an input to more than one given. An input N_a
can be the best match for multiple givens N_b0, ..., N_bi
. In such cases I keep the map with the best score and for the others fall upon the 2nd-best, 3rd-best, ... map. How can I improve BERT's performance to avoid these duplicates?
Current implementation below:
import pandas as pd
import sentence_transformers as st
model = st.SentenceTransformer('all-mpnet-base-v2')
sentences_input = [ ... ]
sentences_given = [ ... ]
# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model encode(sentences_given)
# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)
# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
(columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
for j, sinput in enumerate(sentences_input):
df.loc[sgiven, sinput] = cos_similarity[j,i].item()
# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)
# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
if not sentences:
break
if sgiven not in sentences or sinput not in sentences:
continue
mapping_good[sgiven] = sinput
sentences.remove(sgiven)
sentences.remove(sinput)
# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)
英文:
I want to map N_a
input sentences to N_b
given sentences so that the mapping is one-to-one. That is,
- Every
N_a
is assigned - No
N_a
appears more than once
Unfortunately the inputs vary slightly over time. Here is a representative mapping:
{ "typeA": "step1 typeA (before typeB)"
, "typeD": "type D"
, "actionA": "actionA-suffix: typeB or Type D (type E available)"
, "typeE": "typeE - (not-actionA)"
, "actionB": "actionB some descriptive words"
, "typeA subtypeA": "subtypeA typeA or typeX - (not for typeB)"
, "actionA subtypeA": "actionA-suffix: subtypeA (type E available)"
, "typeB subtypeA": "subtypeA typeB"
, "typeC subtypeA": "subtypeA typeC"
, "typeB": "typeB (not subtypeA)"
, "typeF": "get typeF or typeF-subtypeA"
, "typeF actionB": "actionB typeF or typeF subtypeA"
}
Following [1], I've created this workflow using the sentence_transformers
package[2]:
- BERT -> mean-pool -> cosine similarity
Given the inputs it is clear that string-based alignment plus edit-distance (of the type featured in rapidfuzz
[3]) won't work. But I'm not sure if BERT is the best approach, or if it is overkill for my needs. Perhaps I should use a word embedding model (word2vec, glove, etc) rather than a sentence embedding model. For this particular task I wonder if BERT could be tweaked to perform better.
The categories aren't well differentiated so that BERT sometimes maps an input to more than one given. An input N_a
can be the best match for multiple givens N_b0, ..., N_bi
. In such cases I keep the map with the best score and for the others fall upon the 2nd-best, 3rd-best, ... map. How can I improve BERT's performance to avoid these duplicates?
Current implementation below:
import pandas as pd
import sentence_transformers as st
model = st.SentenceTransformer('all-mpnet-base-v2')
sentences_input = [ ... ]
sentences_given = [ ... ]
# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model.encode(sentences_given)
# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)
# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
(columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
for j, sinput in enumerate(sentences_input):
df.loc[sgiven, sinput] = cos_similarity[j,i].item()
# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)
# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
if not sentences:
break
if sgiven not in sentences or sinput not in sentences:
continue
mapping_good[sgiven] = sinput
sentences.remove(sgiven)
sentences.remove(sinput)
# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)
答案1
得分: 1
这里不需要任何复杂的自然语言处理。这看起来像一个相对简单的分配问题。对于字符串a
和b
,分别属于a_list
和b_list
,定义score(a, b)
为a
和b
共有的单词数量,然后解决分配问题以最大化得分。
下面的代码并没有完全解决分配问题,而是返回了一种贪婪解决方案,但它仍然找到了最佳匹配:
def score(a, b):
b_words = b.replace(' - ', ' ').split()
return sum((w in b_words) for w in a.split())
def greedy_matching(a_list, b_list):
b_available = set(b_list)
for a in sorted(a_list, key=len, reverse=True):
b = max(b_available, key=lambda b: score(a.lower(), b))
yield (a, b)
b_available.remove(b)
a_list = ['typeC subtypeA', 'actionA', 'typeD', 'typeE', 'typeF', 'actionA subtypeA', 'typeB', 'typeF actionB', 'typeB subtypeA', 'actionB', 'typeA subtypeA', 'typeA']
b_list = ['typeB (not subtypeA)', 'typeE - (not-actionA)', 'actionB some descriptive words', 'get typeF or typeF-subtypeA', 'type D', 'subtypeA typeA or typeX - (not for typeB)', 'step1 typeA (before typeB)', 'subtypeA typeC', 'actionA-suffix: subtypeA (type E available)', 'subtypeA typeB', 'actionB typeF or typeF subtypeA', 'actionA-suffix: typeB or Type D (type E available)']
b_list = 展开收缩
pairs = list(greedy_matching(a_list, b_list))
print(*pairs, sep='\n')
输出:
('actionA subtypeA', 'actiona-suffix: subtypea (typee available)')
('typeC subtypeA', 'subtypea typec')
('typeB subtypeA', 'subtypea typeb')
('typeA subtypeA', 'subtypea typea or typex - (not for typeb)')
('typeF actionB', 'actionb typef or typef subtypea')
('actionA', 'actiona-suffix: typeb or typed (typee available)')
('actionB', 'actionb some descriptive words')
('typeD', 'typed')
('typeE', 'typee - (not-actiona)')
('typeF', 'get typef or typef-subtypea')
('typeB', 'typeb (not subtypea)')
('typeA', 'step1 typea (before typeb)')
英文:
There is no need for any fancy NLP at all. This looks like a relatively straightforward assignment problem. For a string a
of a_list
and a string b
of b_list
, define score(a, b)
to be the number of words in common of a
and b
, and then solve the assignment problem to maximise the scores.
Below I didn't even properly solve the assignment problem, and instead returned a greedy solution, and yet it still found the best matches:
def score(a, b):
b_words = b.replace('-', ' ').split()
return sum((w in b_words) for w in a.split())
def greedy_matching(a_list, b_list):
b_available = set(b_list)
for a in sorted(a_list, key=len, reverse=True):
b = max(b_available, key=lambda b: score(a.lower(), b))
yield (a, b)
b_available.remove(b)
a_list = ['typeC subtypeA', 'actionA', 'typeD', 'typeE', 'typeF', 'actionA subtypeA', 'typeB', 'typeF actionB', 'typeB subtypeA', 'actionB', 'typeA subtypeA', 'typeA']
b_list = ['typeB (not subtypeA)', 'typeE - (not-actionA)', 'actionB some descriptive words', 'get typeF or typeF-subtypeA', 'type D', 'subtypeA typeA or typeX - (not for typeB)', 'step1 typeA (before typeB)', 'subtypeA typeC', 'actionA-suffix: subtypeA (type E available)', 'subtypeA typeB', 'actionB typeF or typeF subtypeA', 'actionA-suffix: typeB or Type D (type E available)']
b_list = 展开收缩
pairs = list(greedy_matching(a_list, b_list))
print(*pairs, sep='\n')
Output:
('actionA subtypeA', 'actiona-suffix: subtypea (typee available)')
('typeC subtypeA', 'subtypea typec')
('typeB subtypeA', 'subtypea typeb')
('typeA subtypeA', 'subtypea typea or typex - (not for typeb)')
('typeF actionB', 'actionb typef or typef subtypea')
('actionA', 'actiona-suffix: typeb or typed (typee available)')
('actionB', 'actionb some descriptive words')
('typeD', 'typed')
('typeE', 'typee - (not-actiona)')
('typeF', 'get typef or typef-subtypea')
('typeB', 'typeb (not subtypea)')
('typeA', 'step1 typea (before typeb)')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论