NVidia Rapids筛选器既不起作用,也不引发警告/错误。

huangapple go评论65阅读模式
英文:

NVidia Rapids filter neither works nor raises warn/errors

问题

I am using Rapids 23.04 and trying to select reading from parquet/orc files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf nor cudf seem to work:

我正在使用 Rapids 23.04,尝试基于选定的列和行从 parquet/orc 文件中选择读取。然而,奇怪的是行过滤器不起作用,我无法找到原因。任何帮助将不胜感激。下面提供了一个概念验证。无论是 dask_cudf 还是 cudf 都似乎不起作用:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
...     {
...         "a": list(range(200)),
...         "b": list(reversed(range(200))),
...         "c": list(range(200)),
...         "d": list(reversed(range(200))),
...     }
... )
>>> df
       a    b    c    d
0      0  199    0  199
1      1  198    1  198
2      2  197    2  197
3      3  196    3  196
4      4  195    4  195
..   ...  ...  ...  ...
195  195    4  195    4
196  196    3  196    3
197  197    2  197    2
198  198    1  198    1
199  199    0  199    0

[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>>

PS: My data size could be very large, hence dask_cudf is more appropriate, though in a few cases cudf could be adequate.

附注:我的数据大小可能非常大,因此 dask_cudf 更适合,尽管在一些情况下 cudf 可能足够。

英文:

I am using Rapids 23.04 and trying to select reading from parquet/orc files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf nor cudf seem to work:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt; import cudf
&gt;&gt;&gt; import dask_cudf
&gt;&gt;&gt; df = cudf.DataFrame(
...     {
...         &quot;a&quot;: list(range(200)),
...         &quot;b&quot;: list(reversed(range(200))),
...         &quot;c&quot;: list(range(200)),
...         &quot;d&quot;: list(reversed(range(200))),
...     }
... )
&gt;&gt;&gt; df
a    b    c    d
0      0  199    0  199
1      1  198    1  198
2      2  197    2  197
3      3  196    3  196
4      4  195    4  195
..   ...  ...  ...  ...
195  195    4  195    4
196  196    3  196    3
197  197    2  197    2
198  198    1  198    1
199  199    0  199    0
[200 rows x 4 columns]
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;)
&gt;&gt;&gt; df.to_orc(&#39;test.orc&#39;)
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, columns=[&#39;a&#39;,&#39;c&#39;], filters=[(&quot;a&quot;, &quot;&lt;&quot;, 150)])
a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199
[200 rows x 2 columns]
&gt;&gt;&gt; ddf = dask_cudf.read_parquet(&#39;test.parquet&#39;, columns=[&#39;a&#39;,&#39;c&#39;], filters=[(&quot;a&quot;, &quot;&lt;&quot;, 150)])
&gt;&gt;&gt; ddf.compute()
a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199
[200 rows x 2 columns]
&gt;&gt;&gt;

PS: My data size could be very large, hence dask_cudf is more appropriate, though in a few cases cudf could be adequate.

答案1

得分: 3

filter=目前用于筛选行组。从cuDF版本23.06开始,filter=将按独立行筛选,并且行为将与您期望的完全相同。

在当前版本的cuDF中,filter=参数用于筛选行组(而不是筛选单独的行)。这最好通过示例来解释:

以下代码段编写了一个每个行组包含3行的Parquet文件:

&gt;&gt;&gt; df = pd.DataFrame({&#39;a&#39;: range(20)})
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;, row_group_size=3)  # 每个组3行

一些使用filter=参数与cudf的示例:

&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;=&#39;, 3)])  # 包含a == 3的行组
   a
3  3
4  4
5  5
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;&lt;&#39;, 10)])  # 包含a < 10的行组
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

诚然,这不是非常直观的。在cuDF 23.06中,我们将更改filters=以应用于单独的行,而不是行组。出奇不意的是,这个改进在您提出这个问题的几分钟前合并到了cuDF

英文:

TL;DR: filter= currently filters row groups. Beginning with cuDF version 23.06, filter= will filter by single rows and behave exactly as you expect it to.

In the current version of cuDF, the filter= argument is used to filter row groups (rather than filtering individual rows). This is best explained with an example:

The following snippet writes a Parquet file with 3 rows per row group:

&gt;&gt;&gt; df = pd.DataFrame({&#39;a&#39;: range(20)})
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;, row_group_size=3)  # 3 rows per group

Some examples of using the filter= argument with cudf:

&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;=&#39;, 3)])  # row group(s) containing a == 3
   a
3  3
4  4
5  5
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;&lt;&#39;, 10)])  # row group(s) containing a &lt; 10
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

Admittedly this is not very intuitive. In cuDF 23.06, we will change filters= to apply to single rows, rather than row groups. By curious co-incidence, this improvement was merged to cuDF just a few minutes before you raised this question!

huangapple
  • 本文由 发表于 2023年5月17日 23:22:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定