英文:
Is there a faster way to select an item from Polars LazyFrame
问题
我有一个大型数据集,大小在数百千兆字节范围内。我正在使用Polar的LazyFrame scan_csv
来读取文件,因为它具有内存效率。我需要在特定时间快速返回任何随机项。我的第一次尝试是使用 slice
。我希望这样做会很快,对于靠近开头的项确实很快。但是,如果我遇到靠近结尾的项,那么速度会非常慢。有更快的方法吗?
重现代码
import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()
这可以快速检索文件开头附近的项,但对于接近结尾的项速度会明显变慢。
英文:
I have a large dataset, in the hundreds of gigabyte range. I am using Polar's LazyFrame scan_csv
to read the file, given it is memory efficient. I need to return any random item quickly at a given time. My first attempt was to use slice
. I was hoping this would be fast, and it is for items near the beginning. But if I encounter items near the end, then it is very slow. Is there a faster way to do this?
Code to reproduce
import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()
This can quickly retrieve items near the beginning of the file, but slows way down for items near the end.
答案1
得分: 1
以下是您要翻译的内容:
这个scan_csv
来源于LazyFrame,没有快速的方法可以做到这一点,因为在某个时候,它必须扫描整个文件以获取朝向末尾的随机行。
这是csv格式的一个缺点,其中读取器只能通过逐行扫描文件并寻找\n
字符来表示特定行的结束,从而达到任意行。
如果您不关心它是哪一行,那么您只需在文件中seek
到一个随机位置,找到该行的末尾,然后获取下一行即可。但是,polars没有针对这种操作进行优化。这样做有问题,因为跟在较长行后面的行被选中的几率更大,所以根据行长的差异和随机性的重要性,这可能会使这个方法无法使用。
尽管有免责声明,您可以这样做:
import random
import os
with open(A_very_large_text_file, "r") as ff:
ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
ff.readline() # ignore partial line
randomish_row=pl.DataFrame({"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})
或者,使用pyarrow将您的csv文件转换为具有多个行组的parquet文件。然后,您可以使用scan_parquet
创建LazyFrame。由于parquet文件具有高度结构化,它可以更高效地跳转到文件的随机部分。参见这里。
英文:
There's not a quick way to do it with a scan_csv
originated LazyFrame because at some point it has to scan the whole file to get a random row towards the end.
This is a shortcoming of the csv format where the reader can only get to an arbitrary line by scanning through the file line by line looking for the \n
character to denote the end of a particular line.
If you didn't care about knowing which line it is then you could just seek
to a random place in the file, find the end of that line and then take the next full line but polars isn't optimized to do that. Doing this is problematic because lines which follow longer lines will have a greater chance of being selected so depending on the variance in line length and the importance of randomness, this might make this unusable.
Notwithstanding the disclaimer, you could do:
import random
import os
with open(A_very_large_text_file, "r") as ff:
ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
ff.readline() # ignore partial line
randomish_row=pl.DataFrame({f"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})
Alternatively, use pyarrow to convert your csv file into a parquet file with multiple row groups. Then you can create your LazyFrame with scan_parquet
. Since parquet files are highly structured, it can much more efficiently jump to a random part of the file. See here
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论