有没有一种更快的方式来从Polars LazyFrame中选择一个项目

huangapple go评论78阅读模式
英文:

Is there a faster way to select an item from Polars LazyFrame

问题

我有一个大型数据集,大小在数百千兆字节范围内。我正在使用Polar的LazyFrame scan_csv 来读取文件,因为它具有内存效率。我需要在特定时间快速返回任何随机项。我的第一次尝试是使用 slice。我希望这样做会很快,对于靠近开头的项确实很快。但是,如果我遇到靠近结尾的项,那么速度会非常慢。有更快的方法吗?

重现代码

import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()

这可以快速检索文件开头附近的项,但对于接近结尾的项速度会明显变慢。

英文:

I have a large dataset, in the hundreds of gigabyte range. I am using Polar's LazyFrame scan_csv to read the file, given it is memory efficient. I need to return any random item quickly at a given time. My first attempt was to use slice. I was hoping this would be fast, and it is for items near the beginning. But if I encounter items near the end, then it is very slow. Is there a faster way to do this?

Code to reproduce

import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()

This can quickly retrieve items near the beginning of the file, but slows way down for items near the end.

答案1

得分: 1

以下是您要翻译的内容:

这个scan_csv来源于LazyFrame,没有快速的方法可以做到这一点,因为在某个时候,它必须扫描整个文件以获取朝向末尾的随机行。

这是csv格式的一个缺点,其中读取器只能通过逐行扫描文件并寻找\n字符来表示特定行的结束,从而达到任意行。

如果您不关心它是哪一行,那么您只需在文件中seek到一个随机位置,找到该行的末尾,然后获取下一行即可。但是,polars没有针对这种操作进行优化。这样做有问题,因为跟在较长行后面的行被选中的几率更大,所以根据行长的差异和随机性的重要性,这可能会使这个方法无法使用。

尽管有免责声明,您可以这样做:

import random
import os
with open(A_very_large_text_file, "r") as ff:
    ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
    ff.readline() # ignore partial line
    randomish_row=pl.DataFrame({"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})

或者,使用pyarrow将您的csv文件转换为具有多个行组的parquet文件。然后,您可以使用scan_parquet创建LazyFrame。由于parquet文件具有高度结构化,它可以更高效地跳转到文件的随机部分。参见这里

英文:

There's not a quick way to do it with a scan_csv originated LazyFrame because at some point it has to scan the whole file to get a random row towards the end.

This is a shortcoming of the csv format where the reader can only get to an arbitrary line by scanning through the file line by line looking for the \n character to denote the end of a particular line.

If you didn't care about knowing which line it is then you could just seek to a random place in the file, find the end of that line and then take the next full line but polars isn't optimized to do that. Doing this is problematic because lines which follow longer lines will have a greater chance of being selected so depending on the variance in line length and the importance of randomness, this might make this unusable.

Notwithstanding the disclaimer, you could do:

import random
import os
with open(A_very_large_text_file, "r") as ff:
    ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
    ff.readline() # ignore partial line
    randomish_row=pl.DataFrame({f"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})

Alternatively, use pyarrow to convert your csv file into a parquet file with multiple row groups. Then you can create your LazyFrame with scan_parquet. Since parquet files are highly structured, it can much more efficiently jump to a random part of the file. See here

huangapple
  • 本文由 发表于 2023年6月13日 04:56:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460282.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定