2023年6月13日 18:17:38go评论98阅读模式

英文:

Python/Bash 'MemoryError': how can I make my script more efficient?

问题

以下是您要翻译的内容：

我有以下的脚本，我编写了它来计算我在语言学角度下分析的文本语料的统计信息。然而，我分析的文本文件对于这样的处理来说相对较大（约3GB，约500M字），这可能是因为我的当前硬件（i5，16GB RAM）使我的脚本效率低下的原因。我在终端上运行脚本时遇到了"MemoryError"错误，所以我必须承认我不确定这是Python还是Bash错误消息，尽管我认为它们的影响是相同的，但如果我错了，请纠正我。

我不是计算机科学家，所以很可能我使用的工具不是最适合/高效的，那么有人能推荐如何改进脚本并使其能够处理如此大量的数据吗？请记住，我的技术/编程知识相对有限，我首先是一名语言学家，所以如果您能以这个背景来解释技术内容，那将非常棒。

非常感谢您的帮助！

编辑：以下是我得到的错误消息，如您们一些人所要求的：

"Traceback (most recent call last):
File "/path/to/my/myscript.py", line 43, in
keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt')
File "/path/to/my/myscript.py", line 9, in calculate_keywords
target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
MemoryError

#!/usr/bin/env python3
import collections
import math
import string
def calculate_keywords(target, reference):
    with open(target, 'r') as f:
        target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        target_words = target_text.split()
    with open(reference, 'r') as f:
        reference_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        reference_words = reference_text.split()
    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)
    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())
    target_norm = {}
    reference_norm = {}
    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000
    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

英文:

I have the following script which I have written for calculating statistics about textual corpora I analyze under a linguistics angle. However, the text files I analyze are relatively big for such processes (~3Gb, ~500M words), which is probably what makes my script inefficient given my current hardware (i5, 16Gb RAM). The 'MemoryError' I get is when I launch the script through the Terminal, so I must admit that I am unsure whether this is a Python of Bash error message, although I reckon that the implications are the same, but correct me if I'm wrong.

I am not a computer scientist, and so it is very likely that the tools I use are not the most adapted/efficient for the task, so would anyone have any recommendation to improve the script and make it able to handle such volumes of data? Please keep in mind that my tech/programming knowledge is relatively limited, being a linguist before all, so if you could explain the technical stuff with that in mind that would be awesome.

Thanks a lot in advance!

EDIT: here is the error message I get, as required by some of you:

"Traceback (most recent call last):
File "/path/to/my/myscript.py", line 43, in <module>
keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt')
File "/path/to/my/myscript.py", line 9, in calculate_keywords
target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
MemoryError

#!/usr/bin/env python3
import collections
import math
import string
def calculate_keywords(target, reference):
    with open(target, &#39;r&#39;) as f:
        target_text = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;))
        target_words = target_text.split()
    with open(reference, &#39;r&#39;) as f:
        reference_text = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;))
        reference_words = reference_text.split()
    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)
    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())
    
    target_norm = {}
    reference_norm = {}
    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000
    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    
keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

答案1

得分: 2

你可以在构建了Counter对象之后（例如，del target_words）删除target_words和reference_words来减少内存消耗。这些对象仍然在作用域内，因此直到calculate_keywords() 完成才能进行垃圾回收。您也可以通过编写独立的函数来处理一些处理，而无需显式使用del来实现这一点：

import collections
def get_counter(filename):
    with open(filename) as f:
        words = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
        return collections.Counter(words)
def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}
def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

这样可能会改善内存使用情况，因为在get_counter()和get_norm()中使用的内存会超出作用域，因此可以被释放（垃圾回收）。

英文:

You might be able to reduce memory consumption by deleting target_words and reference_words after you've built the Counter objects (e.g., del target_words). These objects remain in scope and therefore cannot be garbage collected until calculate_keywords() has terminated. You could also achieve this without explicit use of del by writing discrete functions to handle some of the processing:

import collections
def get_counter(filename):
    with open(filename) as f:
        words = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;)).split()
        return collections.Counter(words)
def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}
def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    
keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

This will potentially improve matters as the memory used in get_counter() and get_norm() goes out of scope and can therefore be released (garbage collected)

答案2

得分: 0

以下是翻译好的部分：

这是一个工作良好且速度非常快的解决方案：
#!/usr/bin/env python3
import collections
def words_in_line(line):
    return line.lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
def get_counter(filename):
    Counter = collections.Counter()
    with open(filename) as file:
        for line in file:
            Counter.update(words_in_line(line))
    return Counter
def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}
def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

请注意，这是您提供的Python代码的翻译部分。

英文:

Here is a working solution that's also very fast:

#!/usr/bin/env python3
import collections
def words_in_line(line):
    return line.lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;)).split()
 
def get_counter(filename):
    Counter = collections.Counter()
    with open(filename) as file:
        for line in file:
            Counter.update(words_in_line(line))
    return Counter
def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}
def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)
    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2
    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    
keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python/Bash ‘MemoryError’: 如何使我的脚本更高效？

问题

答案1

答案2

在C中的for循环随机中断。

将一个Python字典转换为正确的Python基础模型（BaseModel）pydantic类

如何显示空值 pandas

在Django中创建条件Many to Many关系

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。