2023年6月29日 21:12:02go评论160阅读模式

英文:

Subselect features in Dask Dataframe

问题

I have a dask dataframe ddf with a matrix ddf['X'] and a list of indices indices. I want to select the features (columns) of ddf['X'] at the indices. My current implementation is

def subselect_variables(df):
    subset = df.iloc[:, indices]
    return subset
ddf_X = (
        ddf['X']
        .map_partitions(subselect_variables, meta={col: 'f4' for col in range(len(indices))})
    )
ddf_X.to_parquet(
    my_path,
    engine='pyarrow',
    schema=my_schema,
    write_metadata_file=True,
    row_group_size=my_row_group_size
    )

But it results in the error pandas.errors.IndexingError: Too many indexers. Can someone help?

I also tried to directly select the features

ddf_X = (
        ddf['X']
        .map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4' for col in range(len(indices))})
    )

Which resulted in the same error.
I also tried replacing : with slice(None), which also resulted in the same error.

英文:

I have a dask dataframe ddf with a matrix ddf['X'] and a list of indices indices. I want to select the features (columns) of ddf['X'] at the indices. My current implementation is

def subselect_variables(df):
    subset = df.iloc[:, indices]
    return subset
ddf_X = (
        ddf[&#39;X&#39;]
        .map_partitions(subselect_variables, meta={col: &#39;f4&#39;for col in range(len(indices))})
    )
ddf_X.to_parquet(
    my_path,
    engine=&#39;pyarrow&#39;,
    schema=my_schema,
    write_metadata_file=True,
    row_group_size=my_row_group_size
    )

But it results in the error pandas.errors.IndexingError: Too many indexers. Can someone help?

I also tried to directly select the features

ddf_X = (
        ddf[&#39;X&#39;]
        .map_partitions(lambda df: df.iloc[:, indices], meta={col: &#39;f4&#39;for col in range(len(indices))})
    )

Which resulted in the same error.
I also tried replacing : with slice(None), which also resulted in the same error.

答案1

得分: 0

感谢您的建议！它引导我走向了正确的方向。确实，如果ddf['X']是一个Series，它必须被视为一维数据。您还需要考虑元数据分配和外部函数，因为可能会耗尽内存。以下是有效的解决方案：

def subselect_variables(df):
    subset = df.map(lambda x: [x[i] for i in indices])
    return subset

ddf_X = (
        ddf['X']
        .map_partitions(subselect_series, meta=('X', 'f4'))
    )

要将其写入Parquet文件，您还需要将其转换为一个Dask DataFrame，例如：

if isinstance(ddf_X, dd.Series):
        ddf_X = ddf_X.to_frame(name='X')

英文:

Thanks for your suggestions! It led me in the right direction. Indeed, if ddf['X'] is a Series, it must be treated as 1-dimensional. What you also need to consider is the meta assignment and the external function, as you might run out of memory. Here is the solution that worked:

def subselect_variables(df):
    subset = df.map(lambda x: [x[i] for i in indices])
    return subset

ddf_X = (
        ddf[&#39;X&#39;]
        .map_partitions(subselect_series, meta=(&#39;X&#39;, &#39;f4&#39;))
    )

To write it into a parquet file, you also need to cast it to a dask DataFrame, e.g. like

if isinstance(ddf_X, dd.Series):
        ddf_X = ddf_X.to_frame(name=&#39;X&#39;)

答案2

得分: -1

You are trying to index a one-dimensional thing (Series) with two dimensions or indexing. You may think it is 2D because each element is a list, but to pandas this just looks like a one-dimensional set of objects whose internals pandas knows nothing about. This has nothing to do with dask.

你试图使用两个维度或索引来索引一维数据（Series）。你可能认为它是二维的，因为每个元素都是一个列表，但对于pandas来说，这只是一组一维对象，pandas对其内部一无所知。这与dask无关。

You need to figure out how you would do this indexing in pandas before trying it in dask. Pandas is not able to index into lists in an object series. They may be a way to do that more directly with arrow, awkward, (or even numpy?) or by first expanding out the lists into columns (explode?). However the following will work, if slowly and inefficiently.

在尝试在dask中执行之前，你需要弄清楚如何在pandas中进行这种索引。Pandas不能索引对象系列中的列表。可能有一种更直接的方法来实现这个，可以使用arrow、awkward（甚至是numpy？），或者首先将列表展开为列（explode？）。然而，以下方法会工作，尽管效率较低。

Something like this grabs just the values you are after, but still keeps them in lists:

类似这样的操作只获取你想要的值，但仍然保留它们在列表中：

ddf_X = (
ddf['X']
.map(lambda value: [v for i, v in enumerate(value) if i in indices])
)

Maybe you want

也许你想要这样做

ddf_X = (
ddf['X']
.map_partitions(
lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
meta={col: 'f4' for col in range(len(indices))}
)
)

英文:

Something like this grabs just the values you are after, but still keeps them in lists:

ddf_X = (
    ddf[&#39;X&#39;]
    .map(lambda value: [v for i, v in enumerate(value) if i in indices])
)

Maybe you want

ddf_X = (
    ddf[&#39;X&#39;]
    .map_partitions(
         lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
         meta={col: &#39;f4&#39;for col in range(len(indices))
    )
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Dask数据框中选择子集特征

问题

答案1

答案2

禁用双引号

Pandas groupby(pd.Grouper) is throwing error for datetime but im running it on a datetime object

如何用同一行中的列值替换列表中的列名？

比较 pandas 数据框的列，忽略文本前面的数字。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论