2023年7月27日 23:12:46go评论95阅读模式

英文:

Optimizing Memory Usage in Python for Large Dataset Processing

问题

我正在使用Python处理一个数据处理项目，需要处理包含数百万条记录的大型数据集。我注意到随着数据处理的进行，我的程序的内存使用量不断增加，最终导致了内存错误（MemoryError）。我尝试使用生成器和按块迭代数据，但似乎并未完全解决问题。

我怀疑可能是由于我使用的库或我存储中间结果的方式导致了一些内存开销。我想知道是否有任何用于优化Python中处理大型数据集的内存使用的最佳实践或技术。

以下是我的代码的简化版本：

# 示例：处理大型数据集
def process_data():
    # 假设data_source是一个生成器或迭代器
    data_source = get_large_dataset()  # 提供数据源的某个函数
    # 初始化空列表以存储中间结果
    intermediate_results = []
    for data in data_source:
        # 对数据进行一些处理
        result = perform_computation(data)
        # 将中间结果存储在列表中
        intermediate_results.append(result)
    # 对中间结果进行进一步处理
    final_result = aggregate_results(intermediate_results)
    return final_result
def get_large_dataset():
    # 在实际情况下，此函数将从文件、数据库或其他来源中获取数据。
    # 对于此示例，我们将生成样本数据。
    num_records = 1000000  # 一百万条记录
    for i in range(num_records):
        yield i
def perform_computation(data):
    # 对每个数据点进行一些计算
    result = data * 2  # 仅用于示例，假设将数据乘以2
    return result
def aggregate_results(results):
    # 一些用于处理中间结果的聚合函数
    return sum(results)
if __name__ == "__main__":
    final_result = process_data()
    print("最终结果：", final_result)

我将不会回答翻译请求之外的问题。

英文:

I'm working on a data processing project in Python where I need to handle a large dataset containing millions of records. I've noticed that my program's memory usage keeps increasing as I process the data, and eventually, it leads to MemoryError. I've tried using generators and iterating over the data in chunks, but it doesn't seem to solve the problem entirely.

I suspect that there might be some memory overhead from the libraries I'm using or the way I'm storing intermediate results. I want to know if there are any best practices or techniques to optimize memory usage in Python for processing large datasets.

Here's a simplified version of my code:

# Example: Processing a Large Dataset
def process_data():
    # Assuming data_source is a generator or an iterator
    data_source = get_large_dataset()  # Some function that provides the data source
    # Initializing empty lists to store intermediate results
    intermediate_results = []
    for data in data_source:
        # Some processing on the data
        result = perform_computation(data)
        # Storing intermediate results in a list
        intermediate_results.append(result)
    # Further processing on intermediate results
    final_result = aggregate_results(intermediate_results)
    return final_result
def get_large_dataset():
    # In a real scenario, this function would fetch data from a file, database, or other sources.
    # For this example, we&#39;ll generate sample data.
    num_records = 1000000  # One million records
    for i in range(num_records):
        yield i
def perform_computation(data):
    # Some computation on each data point
    result = data * 2  # For example purposes, let&#39;s just multiply the data by 2
    return result
def aggregate_results(results):
    # Some aggregation function to process intermediate results
    return sum(results)
if __name__ == &quot;__main__&quot;:
    final_result = process_data()
    print(&quot;Final Result:&quot;, final_result)

I'd appreciate any insights, tips, or code examples that can help me efficiently handle large datasets without running into memory issues. Thank you in advance!

答案1

得分: 1

以下是已翻译好的内容：

原文:
The only thing in that code that grows significant memory is the list of intermediate results. You could make that a generator.

翻译:
这段代码中唯一占用大量内存的是中间结果的列表。您可以将其改为生成器。

原文:
Change

    intermediate_results = []
    for data in data_source:
        result = perform_computation(data)
        intermediate_results.append(result)
    final_result = aggregate_results(intermediate_results)

to:

    def intermediate_results():
        for data in data_source:
            result = perform_computation(data)
            yield result
    final_result = aggregate_results(intermediate_results())

翻译:
将以下代码进行修改：

    intermediate_results = []
    for data in data_source:
        result = perform_computation(data)
        intermediate_results.append(result)
    final_result = aggregate_results(intermediate_results)

改为：

    def intermediate_results():
        for data in data_source:
            result = perform_computation(data)
            yield result
    final_result = aggregate_results(intermediate_results())

原文:
Or if you actually have such functions that do all the work, you could just map:

def process_data():
    return aggregate_results(map(perform_computation, get_large_dataset()))

翻译:
或者，如果您实际上有这样的函数来完成所有工作，您可以只需使用map：

def process_data():
    return aggregate_results(map(perform_computation, get_large_dataset()))

英文:

The only thing in that code that grows significant memory is the list of intermediate results. You could make that a generator.

Change

    intermediate_results = []
    for data in data_source:
        result = perform_computation(data)
        intermediate_results.append(result)
    final_result = aggregate_results(intermediate_results)

to:

    def intermediate_results():
        for data in data_source:
            result = perform_computation(data)
            yield result
    final_result = aggregate_results(intermediate_results())

Or if you actually have such functions that do all the work, you could just map:

def process_data():
    return aggregate_results(map(perform_computation, get_large_dataset()))

答案2

得分: 0

使用生成器和迭代器来以块的方式处理数据。
避免不必要的中间结果以减少内存消耗。
根据您的特定需求选择内存高效的数据结构。
考虑从磁盘流式传输数据，而不是全部加载到内存中。
对代码中消耗内存的部分进行性能分析和优化。

英文:

Use generators and iterators to process data in chunks.
Avoid unnecessary intermediate results to minimize memory consumption.
Choose memory-efficient data structures for your specific needs.
Consider streaming data from disk instead of loading it all into memory.
Profile and optimize memory-intensive sections of your code.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

优化Python中大型数据集处理的内存使用

问题

答案1

答案2

以下代码导致无限循环，尽管我在while循环中没有使用头节点。

如何在行中计算“Y”？

pandas.read_xml() 意外行为

在FastAPI中验证小数位数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。