问题

我有一个大型数据集，大小在数百千兆字节范围内。我正在使用Polar的LazyFrame scan_csv 来读取文件，因为它具有内存效率。我需要在特定时间快速返回任何随机项。我的第一次尝试是使用 slice。我希望这样做会很快，对于靠近开头的项确实很快。但是，如果我遇到靠近结尾的项，那么速度会非常慢。有更快的方法吗？

重现代码

import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()

这可以快速检索文件开头附近的项，但对于接近结尾的项速度会明显变慢。

英文:

I have a large dataset, in the hundreds of gigabyte range. I am using Polar's LazyFrame scan_csv to read the file, given it is memory efficient. I need to return any random item quickly at a given time. My first attempt was to use slice. I was hoping this would be fast, and it is for items near the beginning. But if I encounter items near the end, then it is very slow. Is there a faster way to do this?

Code to reproduce

import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()

This can quickly retrieve items near the beginning of the file, but slows way down for items near the end.

答案1

得分: 1

以下是您要翻译的内容：

这个scan_csv来源于LazyFrame，没有快速的方法可以做到这一点，因为在某个时候，它必须扫描整个文件以获取朝向末尾的随机行。

这是csv格式的一个缺点，其中读取器只能通过逐行扫描文件并寻找\n字符来表示特定行的结束，从而达到任意行。

如果您不关心它是哪一行，那么您只需在文件中seek到一个随机位置，找到该行的末尾，然后获取下一行即可。但是，polars没有针对这种操作进行优化。这样做有问题，因为跟在较长行后面的行被选中的几率更大，所以根据行长的差异和随机性的重要性，这可能会使这个方法无法使用。

尽管有免责声明，您可以这样做：

import random
import os
with open(A_very_large_text_file, "r") as ff:
    ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
    ff.readline() # ignore partial line
    randomish_row=pl.DataFrame({"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})

或者，使用pyarrow将您的csv文件转换为具有多个行组的parquet文件。然后，您可以使用scan_parquet创建LazyFrame。由于parquet文件具有高度结构化，它可以更高效地跳转到文件的随机部分。参见这里。

英文:

There's not a quick way to do it with a scan_csv originated LazyFrame because at some point it has to scan the whole file to get a random row towards the end.

This is a shortcoming of the csv format where the reader can only get to an arbitrary line by scanning through the file line by line looking for the \n character to denote the end of a particular line.

If you didn't care about knowing which line it is then you could just seek to a random place in the file, find the end of that line and then take the next full line but polars isn't optimized to do that. Doing this is problematic because lines which follow longer lines will have a greater chance of being selected so depending on the variance in line length and the importance of randomness, this might make this unusable.

Notwithstanding the disclaimer, you could do:

import random
import os
with open(A_very_large_text_file, &quot;r&quot;) as ff:
    ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
    ff.readline() # ignore partial line
    randomish_row=pl.DataFrame({f&quot;col{i}&quot;:x for i, x in enumerate(ff.readline()[:-1].split(&quot;,&quot;))})

Alternatively, use pyarrow to convert your csv file into a parquet file with multiple row groups. Then you can create your LazyFrame with scan_parquet. Since parquet files are highly structured, it can much more efficiently jump to a random part of the file. See here

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有一种更快的方式来从Polars LazyFrame中选择一个项目

问题

答案1

为什么在mypy中一个类型被识别为不是它本身？

Which metrics are printed (train or validation) when validation_split and validation_data is not specified in the keras model.fit function?

ValueError 由于在 pandas 数据框中替换值时出现重复轴。

队列：入队 vs 出队（填写表格，是否需要移位？）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论