Python/Bash ‘MemoryError’: 如何使我的脚本更高效?

huangapple go评论75阅读模式
英文:

Python/Bash 'MemoryError': how can I make my script more efficient?

问题

以下是您要翻译的内容:

我有以下的脚本,我编写了它来计算我在语言学角度下分析的文本语料的统计信息。然而,我分析的文本文件对于这样的处理来说相对较大(约3GB,约500M字),这可能是因为我的当前硬件(i5,16GB RAM)使我的脚本效率低下的原因。我在终端上运行脚本时遇到了"MemoryError"错误,所以我必须承认我不确定这是Python还是Bash错误消息,尽管我认为它们的影响是相同的,但如果我错了,请纠正我。

我不是计算机科学家,所以很可能我使用的工具不是最适合/高效的,那么有人能推荐如何改进脚本并使其能够处理如此大量的数据吗?请记住,我的技术/编程知识相对有限,我首先是一名语言学家,所以如果您能以这个背景来解释技术内容,那将非常棒。

非常感谢您的帮助!

编辑:以下是我得到的错误消息,如您们一些人所要求的:

"Traceback (most recent call last):
File "/path/to/my/myscript.py", line 43, in
keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt')
File "/path/to/my/myscript.py", line 9, in calculate_keywords
target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
MemoryError

#!/usr/bin/env python3

import collections
import math
import string

def calculate_keywords(target, reference):
    with open(target, 'r') as f:
        target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        target_words = target_text.split()

    with open(reference, 'r') as f:
        reference_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        reference_words = reference_text.split()

    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)

    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())

    target_norm = {}
    reference_norm = {}

    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000

    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000

    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores


keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")
英文:

I have the following script which I have written for calculating statistics about textual corpora I analyze under a linguistics angle. However, the text files I analyze are relatively big for such processes (~3Gb, ~500M words), which is probably what makes my script inefficient given my current hardware (i5, 16Gb RAM). The 'MemoryError' I get is when I launch the script through the Terminal, so I must admit that I am unsure whether this is a Python of Bash error message, although I reckon that the implications are the same, but correct me if I'm wrong.

I am not a computer scientist, and so it is very likely that the tools I use are not the most adapted/efficient for the task, so would anyone have any recommendation to improve the script and make it able to handle such volumes of data? Please keep in mind that my tech/programming knowledge is relatively limited, being a linguist before all, so if you could explain the technical stuff with that in mind that would be awesome.

Thanks a lot in advance!

EDIT: here is the error message I get, as required by some of you:

"Traceback (most recent call last):
File "/path/to/my/myscript.py", line 43, in <module>
keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt')
File "/path/to/my/myscript.py", line 9, in calculate_keywords
target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
MemoryError

#!/usr/bin/env python3

import collections
import math
import string

def calculate_keywords(target, reference):
    with open(target, &#39;r&#39;) as f:
        target_text = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;))
        target_words = target_text.split()

    with open(reference, &#39;r&#39;) as f:
        reference_text = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;))
        reference_words = reference_text.split()

    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)

    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())
    
    target_norm = {}
    reference_norm = {}

    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000

    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000

    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

答案1

得分: 2

你可以在构建了Counter对象之后(例如,del target_words)删除target_words和reference_words来减少内存消耗。这些对象仍然在作用域内,因此直到calculate_keywords() 完成才能进行垃圾回收。您也可以通过编写独立的函数来处理一些处理,而无需显式使用del来实现这一点:

import collections

def get_counter(filename):
    with open(filename) as f:
        words = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
        return collections.Counter(words)

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores


keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

这样可能会改善内存使用情况,因为在get_counter()和get_norm()中使用的内存会超出作用域,因此可以被释放(垃圾回收)。

英文:

You might be able to reduce memory consumption by deleting target_words and reference_words after you've built the Counter objects (e.g., del target_words). These objects remain in scope and therefore cannot be garbage collected until calculate_keywords() has terminated. You could also achieve this without explicit use of del by writing discrete functions to handle some of the processing:

import collections

def get_counter(filename):
    with open(filename) as f:
        words = f.read().lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;)).split()
        return collections.Counter(words)

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

This will potentially improve matters as the memory used in get_counter() and get_norm() goes out of scope and can therefore be released (garbage collected)

答案2

得分: 0

以下是翻译好的部分:

这是一个工作良好且速度非常快的解决方案

#!/usr/bin/env python3

import collections

def words_in_line(line):
    return line.lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()

def get_counter(filename):
    Counter = collections.Counter()
    with open(filename) as file:
        for line in file:
            Counter.update(words_in_line(line))
    return Counter

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores


keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

请注意,这是您提供的Python代码的翻译部分。

英文:

Here is a working solution that's also very fast:

#!/usr/bin/env python3

import collections

def words_in_line(line):
    return line.lower().translate(str.maketrans(&#39;&#39;,&#39;&#39;,&#39;?!&quot;():;.,“/[]&#39;)).split()
 
def get_counter(filename):
    Counter = collections.Counter()
    with open(filename) as file:
        for line in file:
            Counter.update(words_in_line(line))
    return Counter

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords(&#39;myfile1.txt&#39;, &#39;myfile2.txt&#39;)
for word in keywords:
    print(f&quot;{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}&quot;)

huangapple
  • 本文由 发表于 2023年6月13日 18:17:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76463876.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定