2023年3月8日 18:51:24go评论72阅读模式

英文:

Map (1:1) N input sentences to N given sentences by similarity

问题

I want to map N_a input sentences to N_b given sentences so that the mapping is one-to-one. That is,

Every N_a is assigned
No N_a appears more than once

Unfortunately the inputs vary slightly over time. Here is a representative mapping:

{ "typeA": "step1 typeA (before typeB)"
, "typeD": "type D"
, "actionA": "actionA-suffix: typeB or Type D (type E available)"
, "typeE": "typeE - (not-actionA)"
, "actionB": "actionB some descriptive words"
, "typeA subtypeA": "subtypeA typeA or typeX - (not for typeB)"
, "actionA subtypeA": "actionA-suffix: subtypeA (type E available)"
, "typeB subtypeA": "subtypeA typeB"
, "typeC subtypeA": "subtypeA typeC"
, "typeB": "typeB (not subtypeA)"
, "typeF": "get typeF or typeF-subtypeA"
, "typeF actionB": "actionB typeF or typeF subtypeA"
}

Following [1], I've created this workflow using the sentence_transformers package[2]:

BERT -> mean-pool -> cosine similarity

Given the inputs it is clear that string-based alignment plus edit-distance (of the type featured in rapidfuzz[3]) won't work. But I'm not sure if BERT is the best approach, or if it is overkill for my needs. Perhaps I should use a word embedding model (word2vec, glove, etc) rather than a sentence embedding model. For this particular task I wonder if BERT could be tweaked to perform better.

The categories aren't well differentiated so that BERT sometimes maps an input to more than one given. An input N_a can be the best match for multiple givens N_b0, ..., N_bi. In such cases I keep the map with the best score and for the others fall upon the 2nd-best, 3rd-best, ... map. How can I improve BERT's performance to avoid these duplicates?

Current implementation below:

import pandas as pd
import sentence_transformers as st

model = st.SentenceTransformer('all-mpnet-base-v2')

sentences_input = [ ... ]
sentences_given = [ ... ]

# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model encode(sentences_given)

# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)

# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
  (columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
  for j, sinput in enumerate(sentences_input):
    df.loc[sgiven, sinput] = cos_similarity[j,i].item()

# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)

# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
  if not sentences:
    break
  if sgiven not in sentences or sinput not in sentences:
    continue
  mapping_good[sgiven] = sinput
  sentences.remove(sgiven)
  sentences.remove(sinput)

# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)

英文:

I want to map N_a input sentences to N_b given sentences so that the mapping is one-to-one. That is,

Every N_a is assigned
No N_a appears more than once

Unfortunately the inputs vary slightly over time. Here is a representative mapping:

{ &quot;typeA&quot;:            &quot;step1 typeA (before typeB)&quot;
, &quot;typeD&quot;:            &quot;type D&quot;
, &quot;actionA&quot;:          &quot;actionA-suffix: typeB or Type D (type E available)&quot;
, &quot;typeE&quot;:            &quot;typeE - (not-actionA)&quot;
, &quot;actionB&quot;:          &quot;actionB some descriptive words&quot;
, &quot;typeA subtypeA&quot;:   &quot;subtypeA typeA or typeX - (not for typeB)&quot;
, &quot;actionA subtypeA&quot;: &quot;actionA-suffix: subtypeA (type E available)&quot;
, &quot;typeB subtypeA&quot;:   &quot;subtypeA typeB&quot;
, &quot;typeC subtypeA&quot;:   &quot;subtypeA typeC&quot;
, &quot;typeB&quot;:            &quot;typeB (not subtypeA)&quot;
, &quot;typeF&quot;:            &quot;get typeF or typeF-subtypeA&quot;
, &quot;typeF actionB&quot;:    &quot;actionB typeF or typeF subtypeA&quot;
}

Following [1], I've created this workflow using the sentence_transformers package[2]:

BERT -> mean-pool -> cosine similarity

Current implementation below:

import pandas as pd
import sentence_transformers as st

model = st.SentenceTransformer(&#39;all-mpnet-base-v2&#39;)

sentences_input = [ ... ]
sentences_given = [ ... ]

# Create the encodings and apply mean pooling.
enc_input = model.encode(sentences_input)
enc_given = model.encode(sentences_given)

# Calculate a cosine similarity matrix
cos_similarity = st.util.cos_sim(enc_input, enc_given)

# As a pandas dataframe, label the matrix with the sentences.
df = pd.DataFrame\
  (columns=sentences_input, index=sentences_given, dtype=float)
for i, sgiven in enumerate(sentences_given):
  for j, sinput in enumerate(sentences_input):
    df.loc[sgiven, sinput] = cos_similarity[j,i].item()

# For each given sentence, extract the best-scoring input sentence. Which
# unfortunately is not a one-to-one mapping.
mapping_bad_duplicates = df.idxmax(axis=1)

# Create a one-to-one mapping by iterating over the matches in order of best
# score. For each map, blocklist the row and column sentences, preventing
# duplicates.
mapping_good = {}
by_scores = sorted(df.unstack().items(), key=lambda k: k[1], reverse=True)
sentences = set(sentences_input) | set(sentences_given)
for (sinput,sgiven), score in by_scores:
  if not sentences:
    break
  if sgiven not in sentences or sinput not in sentences:
    continue
  mapping_good[sgiven] = sinput
  sentences.remove(sgiven)
  sentences.remove(sinput)

# Convert the result to a dataframe
mapping_good_df = pd.Series(mapping_good)

答案1

得分: 1

这里不需要任何复杂的自然语言处理。这看起来像一个相对简单的分配问题。对于字符串a和b，分别属于a_list和b_list，定义score(a, b)为a和b共有的单词数量，然后解决分配问题以最大化得分。

下面的代码并没有完全解决分配问题，而是返回了一种贪婪解决方案，但它仍然找到了最佳匹配:

def score(a, b):
    b_words = b.replace(' - ', ' ').split()
    return sum((w in b_words) for w in a.split())

def greedy_matching(a_list, b_list):
    b_available = set(b_list)
    for a in sorted(a_list, key=len, reverse=True):
        b = max(b_available, key=lambda b: score(a.lower(), b))
        yield (a, b)
        b_available.remove(b)

a_list = ['typeC subtypeA', 'actionA', 'typeD', 'typeE', 'typeF', 'actionA subtypeA', 'typeB', 'typeF actionB', 'typeB subtypeA', 'actionB', 'typeA subtypeA', 'typeA']

b_list = ['typeB (not subtypeA)', 'typeE - (not-actionA)', 'actionB some descriptive words', 'get typeF or typeF-subtypeA', 'type D', 'subtypeA typeA or typeX - (not for typeB)', 'step1 typeA (before typeB)', 'subtypeA typeC', 'actionA-suffix: subtypeA (type E available)', 'subtypeA typeB', 'actionB typeF or typeF subtypeA', 'actionA-suffix: typeB or Type D (type E available)']

b_list = 展开收缩
pairs = list(greedy_matching(a_list, b_list))

print(*pairs, sep='\n')

输出：

('actionA subtypeA', 'actiona-suffix: subtypea (typee available)')
('typeC subtypeA', 'subtypea typec')
('typeB subtypeA', 'subtypea typeb')
('typeA subtypeA', 'subtypea typea or typex - (not for typeb)')
('typeF actionB', 'actionb typef or typef subtypea')
('actionA', 'actiona-suffix: typeb or typed (typee available)')
('actionB', 'actionb some descriptive words')
('typeD', 'typed')
('typeE', 'typee - (not-actiona)')
('typeF', 'get typef or typef-subtypea')
('typeB', 'typeb (not subtypea)')
('typeA', 'step1 typea (before typeb)')

英文:

There is no need for any fancy NLP at all. This looks like a relatively straightforward assignment problem. For a string a of a_list and a string b of b_list, define score(a, b) to be the number of words in common of a and b, and then solve the assignment problem to maximise the scores.

Below I didn't even properly solve the assignment problem, and instead returned a greedy solution, and yet it still found the best matches:

def score(a, b):
    b_words = b.replace(&#39;-&#39;, &#39; &#39;).split()
    return sum((w in b_words) for w in a.split())

def greedy_matching(a_list, b_list):
    b_available = set(b_list)
    for a in sorted(a_list, key=len, reverse=True):
        b = max(b_available, key=lambda b: score(a.lower(), b))
        yield (a, b)
        b_available.remove(b)

a_list = [&#39;typeC subtypeA&#39;, &#39;actionA&#39;, &#39;typeD&#39;, &#39;typeE&#39;, &#39;typeF&#39;, &#39;actionA subtypeA&#39;, &#39;typeB&#39;, &#39;typeF actionB&#39;, &#39;typeB subtypeA&#39;, &#39;actionB&#39;, &#39;typeA subtypeA&#39;, &#39;typeA&#39;]

b_list = [&#39;typeB (not subtypeA)&#39;, &#39;typeE - (not-actionA)&#39;, &#39;actionB some descriptive words&#39;, &#39;get typeF or typeF-subtypeA&#39;, &#39;type D&#39;, &#39;subtypeA typeA or typeX - (not for typeB)&#39;, &#39;step1 typeA (before typeB)&#39;, &#39;subtypeA typeC&#39;, &#39;actionA-suffix: subtypeA (type E available)&#39;, &#39;subtypeA typeB&#39;, &#39;actionB typeF or typeF subtypeA&#39;, &#39;actionA-suffix: typeB or Type D (type E available)&#39;]

b_list = 展开收缩
pairs = list(greedy_matching(a_list, b_list))

print(*pairs, sep=&#39;\n&#39;)

Output:

(&#39;actionA subtypeA&#39;, &#39;actiona-suffix: subtypea (typee available)&#39;)
(&#39;typeC subtypeA&#39;, &#39;subtypea typec&#39;)
(&#39;typeB subtypeA&#39;, &#39;subtypea typeb&#39;)
(&#39;typeA subtypeA&#39;, &#39;subtypea typea or typex - (not for typeb)&#39;)
(&#39;typeF actionB&#39;, &#39;actionb typef or typef subtypea&#39;)
(&#39;actionA&#39;, &#39;actiona-suffix: typeb or typed (typee available)&#39;)
(&#39;actionB&#39;, &#39;actionb some descriptive words&#39;)
(&#39;typeD&#39;, &#39;typed&#39;)
(&#39;typeE&#39;, &#39;typee - (not-actiona)&#39;)
(&#39;typeF&#39;, &#39;get typef or typef-subtypea&#39;)
(&#39;typeB&#39;, &#39;typeb (not subtypea)&#39;)
(&#39;typeA&#39;, &#39;step1 typea (before typeb)&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将1个输入句子与1个给定句子通过相似性映射。

问题

答案1

回文检查代码在def内部无法工作。

how to get the xml tag name after parsing as it is specified in xml without namespace conversion

python：根据索引事件将时间间隔数据分成两天的块

使用Beautiful Soup抓取网页上的第三个表格。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论