英文:
NVidia Rapids filter neither works nor raises warn/errors
问题
I am using Rapids 23.04 and trying to select reading from parquet
/orc
files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf
nor cudf
seem to work:
我正在使用 Rapids 23.04,尝试基于选定的列和行从 parquet
/orc
文件中选择读取。然而,奇怪的是行过滤器不起作用,我无法找到原因。任何帮助将不胜感激。下面提供了一个概念验证。无论是 dask_cudf
还是 cudf
都似乎不起作用:
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
... {
... "a": list(range(200)),
... "b": list(reversed(range(200))),
... "c": list(range(200)),
... "d": list(reversed(range(200))),
... }
... )
>>> df
a b c d
0 0 199 0 199
1 1 198 1 198
2 2 197 2 197
3 3 196 3 196
4 4 195 4 195
.. ... ... ... ...
195 195 4 195 4
196 196 3 196 3
197 197 2 197 2
198 198 1 198 1
199 199 0 199 0
[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>>
PS: My data size could be very large, hence dask_cudf
is more appropriate, though in a few cases cudf
could be adequate.
附注:我的数据大小可能非常大,因此 dask_cudf
更适合,尽管在一些情况下 cudf
可能足够。
英文:
I am using Rapids 23.04 and trying to select reading from parquet
/orc
files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf
nor cudf
seem to work:
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
... {
... "a": list(range(200)),
... "b": list(reversed(range(200))),
... "c": list(range(200)),
... "d": list(reversed(range(200))),
... }
... )
>>> df
a b c d
0 0 199 0 199
1 1 198 1 198
2 2 197 2 197
3 3 196 3 196
4 4 195 4 195
.. ... ... ... ...
195 195 4 195 4
196 196 3 196 3
197 197 2 197 2
198 198 1 198 1
199 199 0 199 0
[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>>
PS: My data size could be very large, hence dask_cudf
is more appropriate, though in a few cases cudf
could be adequate.
答案1
得分: 3
filter=
目前用于筛选行组。从cuDF版本23.06开始,filter=
将按独立行筛选,并且行为将与您期望的完全相同。
在当前版本的cuDF中,filter=
参数用于筛选行组(而不是筛选单独的行)。这最好通过示例来解释:
以下代码段编写了一个每个行组包含3行的Parquet文件:
>>> df = pd.DataFrame({'a': range(20)})
>>> df.to_parquet('test.parquet', row_group_size=3) # 每个组3行
一些使用filter=
参数与cudf的示例:
>>> cudf.read_parquet('test.parquet', filters=[('a', '=', 3)]) # 包含a == 3的行组
a
3 3
4 4
5 5
>>> cudf.read_parquet('test.parquet', filters=[('a', '<', 10)]) # 包含a < 10的行组
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
诚然,这不是非常直观的。在cuDF 23.06中,我们将更改filters=
以应用于单独的行,而不是行组。出奇不意的是,这个改进在您提出这个问题的几分钟前合并到了cuDF!
英文:
TL;DR: filter=
currently filters row groups. Beginning with cuDF version 23.06, filter=
will filter by single rows and behave exactly as you expect it to.
In the current version of cuDF, the filter=
argument is used to filter row groups (rather than filtering individual rows). This is best explained with an example:
The following snippet writes a Parquet file with 3 rows per row group:
>>> df = pd.DataFrame({'a': range(20)})
>>> df.to_parquet('test.parquet', row_group_size=3) # 3 rows per group
Some examples of using the filter=
argument with cudf:
>>> cudf.read_parquet('test.parquet', filters=[('a', '=', 3)]) # row group(s) containing a == 3
a
3 3
4 4
5 5
>>> cudf.read_parquet('test.parquet', filters=[('a', '<', 10)]) # row group(s) containing a < 10
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
Admittedly this is not very intuitive. In cuDF 23.06, we will change filters=
to apply to single rows, rather than row groups. By curious co-incidence, this improvement was merged to cuDF just a few minutes before you raised this question!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论