2023年3月31日 03:09:09go评论92阅读模式

英文:

Polars - Glob read Parquet from S3 only read first file

问题

我尝试从S3中使用Polars读取一些Parquet文件。

这些文件是由Redshift使用UNLOAD和PARALLEL ON生成的。

这4个文件是：0000_part_00.parquet、0001_part_00.parquet、0002_part_00.parquet、0003_part_00.parquet

当我使用：pl.read_parquet("s3://my_bucket/my_folder/*.parquet")时，它只返回第一个文件（0000_part_00.parquet）的结果，共340行。

奇怪的是，在本地运行相同的命令：pl.read_parquet("*.parquet")，会返回所有的行，共1239行。

这是正常行为还是我漏掉了什么？

英文:

I try to read some Parquet files from S3 using Polars.

Those files are generated by Redshift using UNLOAD with PARALLEL ON.

The 4 files are : 0000_part_00.parquet, 0001_part_00.parquet, 0002_part_00.parquet, 0003_part_00.parquet

When I use : pl.read_parquet("s3://my_bucket/my_folder/*.parquet"), it returns the result for only the first file (0000_part_00.parquet) -> 340 rows.

Weird thing is that running the same command locally : pl.read_parquet("*.parquet"), will return all the rows -> 1239 rows.

Is it normal behavior or I am missing something here ?

答案1

得分: 1

> 这是正常行为还是我漏掉了什么？

这是正常的，但你漏掉了一些东西。

从文档中可以看到：

> 参数
> ----------
> source
> 文件的路径，或类似文件的对象。如果路径是一个目录，将使用该目录作为分区感知扫描。
> 如果安装了 fsspec，它将用于打开远程文件。

对于你的本地系统，它将识别它为一个目录并使用分区感知扫描。对于访问s3，它使用了fsspec，并且在路径中不会真正识别 *。

相反，请使用：

import pyarrow.dataset as ds
import fsspec #或者其他s3的封装
s3fs=fsspec.filesystem('s3', connection_string="xxxx")
myds=ds.dataset("s3://my_bucket/my_folder/", filesystem=s3fs)
df = pl.scan_pyarrow_dataset(myds).collect()

英文:

> Is it normal behavior or I am missing something here ?

That's normal and you're missing something.

From the docs:

> Parameters
> ----------
> source
> Path to a file, or a file-like object. If the path is a directory, that
> directory will be used as partition aware scan.
> If fsspec is installed, it will be used to open remote files.

In the case of your local system, it identifies that as a directory and uses a partition aware scan. For accessing s3, it uses fsspec and doesn't really recognize the * in the path.

Instead do:

import pyarrow.dataset as ds
import fsspec #or whatever the s3 wrapper is called
s3fs=fsspec.filesystem(&#39;s3&#39;, connection_string=&quot;xxxx&quot;)
myds=ds.dataset(&quot;s3://my_bucket/my_folder/&quot;, filesystem=s3fs)
df = pl.scan_pyarrow_dataset(myds).collect()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars – 从S3读取Parquet只读取第一个文件

问题

答案1

python sympy不能在使用字母I时替代数值。

如何根据一组列的组合作为主键，从另一个CSV文件更新CSV文件？

使用嵌套字典和列表创建Panda DataFrame：dict:{dict:{dict:[list]}}

检查 Python 3.x 版本的代码兼容性。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。