英文:
Optimize the code to sum scores for the tokens in noun phrase list
问题
要求:
- 以下代码在时间复杂度上看起来很复杂,我正在寻找优化以下代码的解决方案。
以下代码的目标是对在名词短语列表(nounphrase_list)中的标记求和,同时也对所有标记(all_tokens)求和(下面有详细解释)。
full_string = ["Mitchell is using machine learning and software engineering on his mobile phone to generate songs. On top of machine learning he is also using deep learning"]
# 以下标记列表是来自上述 full_string 的单词
all_tokens = ["Mitchell", "is", "using", "machine", "learning",
"and", "software", "engineering", "on", "his",
"mobile", "phone", "to", "generate", "songs",
"On", "top", "of", "machine", "learning",
"he", "is", "also", "using", "deep", "learning"]
all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
# 这些名词短语从 full_string 中提取
nounphrase_list = ["Mitchell", "machine learning", "software engineering", "machine learning", "deep learning"]
updated_scores = all_scores.copy()
seen_indices = set()
for np_item in nounphrase_list:
np_words = np_item.split()
indices = []
for word in np_words:
for idx, token in enumerate(all_tokens):
if word == token:
if idx not in seen_indices:
indices.append(idx)
seen_indices.add(idx)
break
sum_score = sum(all_scores[idx] for idx in indices)
for idx in indices:
updated_scores[idx] = sum_score
解释:
Mitchell
在名词短语列表(nounphrase_list)中,并且 Mitchell 的分数是 1,因此更新后的分数也是 1。- 名词短语列表中下一个短语是
machine learning
的第一个出现,分数为[4, 5]
,因此machine
的更新分数是 9,learning
的更新分数也是 9。 - 名词短语列表中下一个短语是
software engineering
的第一个出现,分数为[7, 8]
,因此software
的更新分数是 15,learning
的更新分数也是 15。 - 名词短语列表中下一个短语是
machine learning
的第一个出现,分数为[19, 20]
,因此machine
的更新分数是 39,learning
的更新分数也是 39。 - 名词短语列表中下一个短语是
deep learning
的第一个出现,分数为[25, 26]
,因此machine
的更新分数是 51,learning
的更新分数也是 51。
注意:
- 单词
machine
和learning
应以相同的顺序出现。 - 单词
software
和engineering
应以相同的顺序出现。 - 单词
machine
和learning
应以相同的顺序出现。 - 单词
deep
和learning
应以相同的顺序出现。
英文:
Requirement
- The below code looks complicated in time complexity, I am looking for solutions to optimize the below code.
Aim of the below code is to sum up the scores of tokens if the token is in nounphrase_list and also all_tokens (detailed explanation below)
full_string = ["Mitchell is using machine learning and software engineering on his mobile phone to generate songs . On top of machine learning he is also using deep learning"
# the token list are words from above full_string
all_tokens = ["Mitchell", "is", "using", "machine", "learning",
"and", "software", "engineering", "on", "his",
"mobile", "phone", "to", "generate", "songs",
"On", "top", "of", "machine", "learning",
"he", "is", "also", "using", "deep", "learning"]
all_scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
# These noun phrases extracted from the full_string
nounphrase_list = ["Mitchell", "machine learning", "software engineering", "machine learning", "deep learning"]
updated_scores = all_scores.copy()
seen_indices = set()
for np_item in nounphrase_list:
np_words = np_item.split()
indices = []
for word in np_words:
for idx, token in enumerate(all_tokens):
if word == token:
if idx not in seen_indices:
indices.append(idx)
seen_indices.add(idx)
break
sum_score = sum(all_scores[idx] for idx in indices)
for idx in indices:
updated_scores[idx] = sum_score
Explanation:
Mitchell
is nounphrase_list and the score of Mitchell is 1, Hence the updated score is also 1- The next phrase in noun phrase list is the first occurrence of
machine learning
has scores[4, 5]
, Hence the updated scores formachine
is 9 andlearning
is 9 - The next phrase in noun phrase list is the first occurrence of
software engineering
has scores[7, 8]
, Hence the updated scores forsoftware
is 15 andlearning
is 15 - The next phrase in noun phrase list is the first occurrence of
machine learning
has scores[19, 20]
, Hence the updated scores formachine
is 39 andlearning
is 39 - The next phrase in noun phrase list is the first occurrence of
deep learning
has scores[25, 26]
, Hence the updated scores formachine
is 51 andlearning
is 51
Note:
- The words
machine
&learning
should occur together in same order - The words
software
&engineering
should occur together in same order - The words
machine
&learning
should occur together in same order - The words
deep
&learning
should occur together in same order
答案1
得分: 1
以下是已经翻译好的部分:
所以,你的问题有点复杂,但也许我们应该以不同的方式来看待它。
单词的“名词短语”对于这个问题来说基本上是红鱼(指引人入岐途的信息)。因为它们的分数不会改变,所以我们不需要担心它们。
所以,我们只需要看两个词的名词短语:
for i, word in enumerate(all_tokens[:-1]): # 循环遍历除了最后一个单词以外的所有单词(记住单个单词是红鱼!)
if (value := word + " " + all_tokens[i+1]) in nounphrase_list: # 使用海象运算符从列表中移除,以避免意外的重复。
nounphrase_list.remove(value)
total = all_scores[i] + all_scores[i+1] # 组合分数
all_scores[i] = all_scores[i+1] = total # 设置分数
结果:
[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]
时间复杂度
我不是最擅长计算这个,但我可以以老式的方式计算运行时间。
你的代码(运行了10,000
次):.2439
秒。
我的代码(运行了10,000
次):.1208
秒。
你的代码(运行了100,000
次):3.352
秒。
我的代码(运行了100,000
次):1.294
秒。
你的代码(运行了1,000,000
次):26.1021
秒。
我的代码(运行了1,000,000
次):12.4637
秒。
注意:我使用了timeit
模块。显然,这只适用于两个词的“名词短语”。要修改为更长的短语很容易,只需添加一个elif
语句。
希望这对你有所帮助!
英文:
So your question is somewhat complicated, but perhaps we should look at it a different way.
Single-word 'noun phrases' are essentially red herrings for this problem. Since their scores don't change, we do not need to worry about them.
So, we only need to look at the two-word noun phrases:
for i, word in enumerate(all_tokens[:-1]): #loop through all but last word (remember single-words are red herrings!)
if (value := word + " " + all_tokens[i+1]) in nounphrase_list: #use walrus operator to remove from the list so no accidental duplicates.
nounphrase_list.remove(value)
total = all_scores[i] + all_scores[i+1] #combined score
all_scores[i] = all_scores[i+1] = total #set score
Result:
[1, 2, 3, 9, 9, 6, 15, 15, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 39, 39, 21, 22, 23, 24, 51, 51]
Time Complexity
Not the best at calculating this, but I can calculate runtime the old-fashioned way.
Your code (ran 10,000
times): .2439
seconds.
My code (ran 10,000
times): .1208
seconds.
Your code (ran 100,000
times): 3.352
seconds.
My code (ran 100,000
times): 1.294
seconds.
Your code (ran 1,000,000
times): 26.1021
seconds
My code (ran 1,000,000
times): 12.4637
seconds.
Note: I used the timeit
module. Also clearly only works for two-word 'noun phrases'. It's easy to modify for longer phrases, just add an elif
statement.
Hope this helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论