优化代码以对名词短语列表中的标记进行求和。

huangapple go评论62阅读模式
英文:

Optimize the code to sum scores for the tokens in noun phrase list

问题

要求:

  • 以下代码在时间复杂度上看起来很复杂,我正在寻找优化以下代码的解决方案。

以下代码的目标是对在名词短语列表(nounphrase_list)中的标记求和,同时也对所有标记(all_tokens)求和(下面有详细解释)。

full_string = ["Mitchell is using machine learning and software engineering on his mobile phone to generate songs. On top of machine learning he is also using deep learning"]

# 以下标记列表是来自上述 full_string 的单词
all_tokens = ["Mitchell", "is", "using", "machine", "learning",
              "and", "software", "engineering", "on", "his",
              "mobile", "phone", "to", "generate", "songs",
              "On", "top", "of", "machine", "learning",
              "he", "is", "also", "using", "deep", "learning"]

all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

# 这些名词短语从 full_string 中提取
nounphrase_list = ["Mitchell", "machine learning", "software engineering", "machine learning", "deep learning"]

updated_scores = all_scores.copy()

seen_indices = set()
for np_item in nounphrase_list:
    np_words = np_item.split()
    indices = []
    for word in np_words:
        for idx, token in enumerate(all_tokens):
            if word == token:
                if idx not in seen_indices:
                    indices.append(idx)
                    seen_indices.add(idx)
                    break
    sum_score = sum(all_scores[idx] for idx in indices)
    for idx in indices:
        updated_scores[idx] = sum_score

解释

  • Mitchell 在名词短语列表(nounphrase_list)中,并且 Mitchell 的分数是 1,因此更新后的分数也是 1。
  • 名词短语列表中下一个短语是 machine learning 的第一个出现,分数为 [4, 5],因此 machine 的更新分数是 9,learning 的更新分数也是 9。
  • 名词短语列表中下一个短语是 software engineering 的第一个出现,分数为 [7, 8],因此 software 的更新分数是 15,learning 的更新分数也是 15。
  • 名词短语列表中下一个短语是 machine learning 的第一个出现,分数为 [19, 20],因此 machine 的更新分数是 39,learning 的更新分数也是 39。
  • 名词短语列表中下一个短语是 deep learning 的第一个出现,分数为 [25, 26],因此 machine 的更新分数是 51,learning 的更新分数也是 51。

注意

  • 单词 machinelearning 应以相同的顺序出现。
  • 单词 softwareengineering 应以相同的顺序出现。
  • 单词 machinelearning 应以相同的顺序出现。
  • 单词 deeplearning 应以相同的顺序出现。
英文:

Requirement

  • The below code looks complicated in time complexity, I am looking for solutions to optimize the below code.

Aim of the below code is to sum up the scores of tokens if the token is in nounphrase_list and also all_tokens (detailed explanation below)

full_string = ["Mitchell is using machine learning and software engineering on his mobile phone to generate songs . On top of machine learning he is also using deep learning"

# the token list are words from above full_string
all_tokens = ["Mitchell", "is", "using", "machine", "learning", 
              "and", "software", "engineering", "on", "his", 
              "mobile", "phone", "to", "generate", "songs", 
              "On", "top", "of", "machine", "learning", 
              "he", "is", "also", "using", "deep", "learning"]

all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

# These noun phrases extracted from the full_string
nounphrase_list = ["Mitchell", "machine learning", "software engineering", "machine learning", "deep learning"]

updated_scores = all_scores.copy()

seen_indices = set()
for np_item in nounphrase_list:
    np_words = np_item.split()
    indices = []
    for word in np_words:
        for idx, token in enumerate(all_tokens):
            if word == token:
                if idx not in seen_indices:
                    indices.append(idx)
                    seen_indices.add(idx)
                    break
    sum_score = sum(all_scores[idx] for idx in indices)
    for idx in indices:
        updated_scores[idx] = sum_score

Explanation:

  • Mitchell is nounphrase_list and the score of Mitchell is 1, Hence the updated score is also 1
  • The next phrase in noun phrase list is the first occurrence of machine learning has scores [4, 5], Hence the updated scores for machine is 9 and learning is 9
  • The next phrase in noun phrase list is the first occurrence of software engineering has scores [7, 8], Hence the updated scores for software is 15 and learning is 15
  • The next phrase in noun phrase list is the first occurrence of machine learning has scores [19, 20], Hence the updated scores for machine is 39 and learning is 39
  • The next phrase in noun phrase list is the first occurrence of deep learning has scores [25, 26], Hence the updated scores for machine is 51 and learning is 51

Note:

  • The words machine & learning should occur together in same order
  • The words software & engineering should occur together in same order
  • The words machine & learning should occur together in same order
  • The words deep & learning should occur together in same order

答案1

得分: 1

以下是已经翻译好的部分:


所以,你的问题有点复杂,但也许我们应该以不同的方式来看待它。

单词的“名词短语”对于这个问题来说基本上是红鱼(指引人入岐途的信息)。因为它们的分数不会改变,所以我们不需要担心它们。

所以,我们只需要看两个词的名词短语:

for i, word in enumerate(all_tokens[:-1]): # 循环遍历除了最后一个单词以外的所有单词(记住单个单词是红鱼!)
    if (value := word + " " + all_tokens[i+1]) in nounphrase_list: # 使用海象运算符从列表中移除,以避免意外的重复。
        nounphrase_list.remove(value)
        total = all_scores[i] + all_scores[i+1] # 组合分数
        all_scores[i] = all_scores[i+1] = total # 设置分数

结果:

[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]


时间复杂度

我不是最擅长计算这个,但我可以以老式的方式计算运行时间。

你的代码(运行了10,000次):.2439秒。

我的代码(运行了10,000次):.1208秒。


你的代码(运行了100,000次):3.352秒。

我的代码(运行了100,000次):1.294秒。


你的代码(运行了1,000,000次):26.1021秒。

我的代码(运行了1,000,000次):12.4637秒。


注意:我使用了timeit模块。显然,这只适用于两个词的“名词短语”。要修改为更长的短语很容易,只需添加一个elif语句。


希望这对你有所帮助!

英文:

So your question is somewhat complicated, but perhaps we should look at it a different way.

Single-word 'noun phrases' are essentially red herrings for this problem. Since their scores don't change, we do not need to worry about them.

So, we only need to look at the two-word noun phrases:

for i, word in enumerate(all_tokens[:-1]): #loop through all but last word (remember single-words are red herrings!)
    if (value := word + " " + all_tokens[i+1]) in nounphrase_list: #use walrus operator to remove from the list so no accidental duplicates.
        nounphrase_list.remove(value)
        total = all_scores[i] + all_scores[i+1] #combined score
        all_scores[i] = all_scores[i+1] = total #set score

Result:

[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]


Time Complexity

Not the best at calculating this, but I can calculate runtime the old-fashioned way.

Your code (ran 10,000 times): .2439 seconds.

My code (ran 10,000 times): .1208 seconds.


Your code (ran 100,000 times): 3.352 seconds.

My code (ran 100,000 times): 1.294 seconds.


Your code (ran 1,000,000 times): 26.1021 seconds

My code (ran 1,000,000 times): 12.4637 seconds.


Note: I used the timeit module. Also clearly only works for two-word 'noun phrases'. It's easy to modify for longer phrases, just add an elif statement.


Hope this helps!

huangapple
  • 本文由 发表于 2023年6月15日 20:17:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482381.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定