问题

I am using Rapids 23.04 and trying to select reading from parquet/orc files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf nor cudf seem to work:

我正在使用 Rapids 23.04，尝试基于选定的列和行从 parquet/orc 文件中选择读取。然而，奇怪的是行过滤器不起作用，我无法找到原因。任何帮助将不胜感激。下面提供了一个概念验证。无论是 dask_cudf 还是 cudf 都似乎不起作用：

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
...     {
...         "a": list(range(200)),
...         "b": list(reversed(range(200))),
...         "c": list(range(200)),
...         "d": list(reversed(range(200))),
...     }
... )
>>> df
       a    b    c    d
0      0  199    0  199
1      1  198    1  198
2      2  197    2  197
3      3  196    3  196
4      4  195    4  195
..   ...  ...  ...  ...
195  195    4  195    4
196  196    3  196    3
197  197    2  197    2
198  198    1  198    1
199  199    0  199    0

[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>>

PS: My data size could be very large, hence dask_cudf is more appropriate, though in a few cases cudf could be adequate.

附注：我的数据大小可能非常大，因此 dask_cudf 更适合，尽管在一些情况下 cudf 可能足够。

英文:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt; import cudf
&gt;&gt;&gt; import dask_cudf
&gt;&gt;&gt; df = cudf.DataFrame(
...     {
...         &quot;a&quot;: list(range(200)),
...         &quot;b&quot;: list(reversed(range(200))),
...         &quot;c&quot;: list(range(200)),
...         &quot;d&quot;: list(reversed(range(200))),
...     }
... )
&gt;&gt;&gt; df
a    b    c    d
0      0  199    0  199
1      1  198    1  198
2      2  197    2  197
3      3  196    3  196
4      4  195    4  195
..   ...  ...  ...  ...
195  195    4  195    4
196  196    3  196    3
197  197    2  197    2
198  198    1  198    1
199  199    0  199    0
[200 rows x 4 columns]
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;)
&gt;&gt;&gt; df.to_orc(&#39;test.orc&#39;)
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, columns=[&#39;a&#39;,&#39;c&#39;], filters=[(&quot;a&quot;, &quot;&lt;&quot;, 150)])
a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199
[200 rows x 2 columns]
&gt;&gt;&gt; ddf = dask_cudf.read_parquet(&#39;test.parquet&#39;, columns=[&#39;a&#39;,&#39;c&#39;], filters=[(&quot;a&quot;, &quot;&lt;&quot;, 150)])
&gt;&gt;&gt; ddf.compute()
a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199
[200 rows x 2 columns]
&gt;&gt;&gt;

PS: My data size could be very large, hence dask_cudf is more appropriate, though in a few cases cudf could be adequate.

答案1

得分: 3

filter=目前用于筛选行组。从cuDF版本23.06开始，filter=将按独立行筛选，并且行为将与您期望的完全相同。

在当前版本的cuDF中，filter=参数用于筛选行组（而不是筛选单独的行）。这最好通过示例来解释：

以下代码段编写了一个每个行组包含3行的Parquet文件：

&gt;&gt;&gt; df = pd.DataFrame({&#39;a&#39;: range(20)})
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;, row_group_size=3)  # 每个组3行

一些使用filter=参数与cudf的示例：

&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;=&#39;, 3)])  # 包含a == 3的行组
   a
3  3
4  4
5  5
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;&lt;&#39;, 10)])  # 包含a < 10的行组
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

诚然，这不是非常直观的。在cuDF 23.06中，我们将更改filters=以应用于单独的行，而不是行组。出奇不意的是，这个改进在您提出这个问题的几分钟前合并到了cuDF！

英文:

TL;DR: filter= currently filters row groups. Beginning with cuDF version 23.06, filter= will filter by single rows and behave exactly as you expect it to.

In the current version of cuDF, the filter= argument is used to filter row groups (rather than filtering individual rows). This is best explained with an example:

The following snippet writes a Parquet file with 3 rows per row group:

&gt;&gt;&gt; df = pd.DataFrame({&#39;a&#39;: range(20)})
&gt;&gt;&gt; df.to_parquet(&#39;test.parquet&#39;, row_group_size=3)  # 3 rows per group

Some examples of using the filter= argument with cudf:

&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;=&#39;, 3)])  # row group(s) containing a == 3
   a
3  3
4  4
5  5
&gt;&gt;&gt; cudf.read_parquet(&#39;test.parquet&#39;, filters=[(&#39;a&#39;, &#39;&lt;&#39;, 10)])  # row group(s) containing a &lt; 10
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

Admittedly this is not very intuitive. In cuDF 23.06, we will change filters= to apply to single rows, rather than row groups. By curious co-incidence, this improvement was merged to cuDF just a few minutes before you raised this question!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

NVidia Rapids筛选器既不起作用，也不引发警告/错误。

问题

答案1

Python: 什么是有效地改变实例的子类的最佳方式（保留原始实例的变量）？

UnicodeEncodeError: 'charmap' codec can't encode character '\u03a3' in position 409: character maps to <undefined>

如何将一个函数映射到命名元组的所有元素？

使用正则表达式组来在pandas数据框中通过同时匹配多个模式来重命名列。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论