2023年6月15日 20:17:43go评论92阅读模式

英文:

Optimize the code to sum scores for the tokens in noun phrase list

问题

要求：

以下代码在时间复杂度上看起来很复杂，我正在寻找优化以下代码的解决方案。

以下代码的目标是对在名词短语列表（nounphrase_list）中的标记求和，同时也对所有标记（all_tokens）求和（下面有详细解释）。

full_string = ["Mitchell is using machine learning and software engineering on his mobile phone to generate songs. On top of machine learning he is also using deep learning"]
# 以下标记列表是来自上述 full_string 的单词
all_tokens = ["Mitchell", "is", "using", "machine", "learning",
              "and", "software", "engineering", "on", "his",
              "mobile", "phone", "to", "generate", "songs",
              "On", "top", "of", "machine", "learning",
              "he", "is", "also", "using", "deep", "learning"]
all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
# 这些名词短语从 full_string 中提取
nounphrase_list = ["Mitchell", "machine learning", "software engineering", "machine learning", "deep learning"]
updated_scores = all_scores.copy()
seen_indices = set()
for np_item in nounphrase_list:
    np_words = np_item.split()
    indices = []
    for word in np_words:
        for idx, token in enumerate(all_tokens):
            if word == token:
                if idx not in seen_indices:
                    indices.append(idx)
                    seen_indices.add(idx)
                    break
    sum_score = sum(all_scores[idx] for idx in indices)
    for idx in indices:
        updated_scores[idx] = sum_score

解释：

Mitchell 在名词短语列表（nounphrase_list）中，并且 Mitchell 的分数是 1，因此更新后的分数也是 1。
名词短语列表中下一个短语是 machine learning 的第一个出现，分数为 [4, 5]，因此 machine 的更新分数是 9，learning 的更新分数也是 9。
名词短语列表中下一个短语是 software engineering 的第一个出现，分数为 [7, 8]，因此 software 的更新分数是 15，learning 的更新分数也是 15。
名词短语列表中下一个短语是 machine learning 的第一个出现，分数为 [19, 20]，因此 machine 的更新分数是 39，learning 的更新分数也是 39。
名词短语列表中下一个短语是 deep learning 的第一个出现，分数为 [25, 26]，因此 machine 的更新分数是 51，learning 的更新分数也是 51。

注意：

单词 machine 和 learning 应以相同的顺序出现。
单词 software 和 engineering 应以相同的顺序出现。
单词 machine 和 learning 应以相同的顺序出现。
单词 deep 和 learning 应以相同的顺序出现。

英文:

Requirement

The below code looks complicated in time complexity, I am looking for solutions to optimize the below code.

Aim of the below code is to sum up the scores of tokens if the token is in nounphrase_list and also all_tokens (detailed explanation below)

full_string = [&quot;Mitchell is using machine learning and software engineering on his mobile phone to generate songs . On top of machine learning he is also using deep learning&quot;
# the token list are words from above full_string
all_tokens = [&quot;Mitchell&quot;, &quot;is&quot;, &quot;using&quot;, &quot;machine&quot;, &quot;learning&quot;, 
              &quot;and&quot;, &quot;software&quot;, &quot;engineering&quot;, &quot;on&quot;, &quot;his&quot;, 
              &quot;mobile&quot;, &quot;phone&quot;, &quot;to&quot;, &quot;generate&quot;, &quot;songs&quot;, 
              &quot;On&quot;, &quot;top&quot;, &quot;of&quot;, &quot;machine&quot;, &quot;learning&quot;, 
              &quot;he&quot;, &quot;is&quot;, &quot;also&quot;, &quot;using&quot;, &quot;deep&quot;, &quot;learning&quot;]
all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
# These noun phrases extracted from the full_string
nounphrase_list = [&quot;Mitchell&quot;, &quot;machine learning&quot;, &quot;software engineering&quot;, &quot;machine learning&quot;, &quot;deep learning&quot;]
updated_scores = all_scores.copy()
seen_indices = set()
for np_item in nounphrase_list:
    np_words = np_item.split()
    indices = []
    for word in np_words:
        for idx, token in enumerate(all_tokens):
            if word == token:
                if idx not in seen_indices:
                    indices.append(idx)
                    seen_indices.add(idx)
                    break
    sum_score = sum(all_scores[idx] for idx in indices)
    for idx in indices:
        updated_scores[idx] = sum_score

Explanation:

Mitchell is nounphrase_list and the score of Mitchell is 1, Hence the updated score is also 1
The next phrase in noun phrase list is the first occurrence of machine learning has scores [4, 5], Hence the updated scores for machine is 9 and learning is 9
The next phrase in noun phrase list is the first occurrence of software engineering has scores [7, 8], Hence the updated scores for software is 15 and learning is 15
The next phrase in noun phrase list is the first occurrence of machine learning has scores [19, 20], Hence the updated scores for machine is 39 and learning is 39
The next phrase in noun phrase list is the first occurrence of deep learning has scores [25, 26], Hence the updated scores for machine is 51 and learning is 51

Note:

The words machine & learning should occur together in same order
The words software & engineering should occur together in same order
The words machine & learning should occur together in same order
The words deep & learning should occur together in same order

答案1

得分: 1

以下是已经翻译好的部分：

所以，你的问题有点复杂，但也许我们应该以不同的方式来看待它。

单词的“名词短语”对于这个问题来说基本上是红鱼（指引人入岐途的信息）。因为它们的分数不会改变，所以我们不需要担心它们。

所以，我们只需要看两个词的名词短语：

for i, word in enumerate(all_tokens[:-1]): # 循环遍历除了最后一个单词以外的所有单词（记住单个单词是红鱼！）
    if (value := word + " " + all_tokens[i+1]) in nounphrase_list: # 使用海象运算符从列表中移除，以避免意外的重复。
        nounphrase_list.remove(value)
        total = all_scores[i] + all_scores[i+1] # 组合分数
        all_scores[i] = all_scores[i+1] = total # 设置分数

结果：

[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]

时间复杂度

我不是最擅长计算这个，但我可以以老式的方式计算运行时间。

你的代码（运行了10,000次）：.2439秒。

我的代码（运行了10,000次）：.1208秒。

你的代码（运行了100,000次）：3.352秒。

我的代码（运行了100,000次）：1.294秒。

你的代码（运行了1,000,000次）：26.1021秒。

我的代码（运行了1,000,000次）：12.4637秒。

注意：我使用了timeit模块。显然，这只适用于两个词的“名词短语”。要修改为更长的短语很容易，只需添加一个elif语句。

希望这对你有所帮助！

英文:

So your question is somewhat complicated, but perhaps we should look at it a different way.

Single-word 'noun phrases' are essentially red herrings for this problem. Since their scores don't change, we do not need to worry about them.

So, we only need to look at the two-word noun phrases:

for i, word in enumerate(all_tokens[:-1]): #loop through all but last word (remember single-words are red herrings!)
    if (value := word + &quot; &quot; + all_tokens[i+1]) in nounphrase_list: #use walrus operator to remove from the list so no accidental duplicates.
        nounphrase_list.remove(value)
        total = all_scores[i] + all_scores[i+1] #combined score
        all_scores[i] = all_scores[i+1] = total #set score

Result:

[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]

Time Complexity

Not the best at calculating this, but I can calculate runtime the old-fashioned way.

Your code (ran 10,000 times): .2439 seconds.

My code (ran 10,000 times): .1208 seconds.

Your code (ran 100,000 times): 3.352 seconds.

My code (ran 100,000 times): 1.294 seconds.

Your code (ran 1,000,000 times): 26.1021 seconds

My code (ran 1,000,000 times): 12.4637 seconds.

Note: I used the timeit module. Also clearly only works for two-word 'noun phrases'. It's easy to modify for longer phrases, just add an elif statement.

Hope this helps!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

优化代码以对名词短语列表中的标记进行求和。

问题

答案1

结果：

时间复杂度

Result:

Time Complexity

在Jupyter笔记本中突出显示两个字符串之间的差异 – Python

Django returns TemplateDoesNotExist for namespaced app templates which are rendered by class views but works for plain function views

设置使用Python 3.9 + Poetry配置Docker时，出现’不包含任何元素’错误。

why collide_mask is significantly slower than collide_rect in pygame?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。