在Dask数据框中选择子集特征

huangapple go评论73阅读模式
英文:

Subselect features in Dask Dataframe

问题

I have a dask dataframe ddf with a matrix ddf['X'] and a list of indices indices. I want to select the features (columns) of ddf['X'] at the indices. My current implementation is

def subselect_variables(df):
    subset = df.iloc[:, indices]
    return subset
ddf_X = (
        ddf['X']
        .map_partitions(subselect_variables, meta={col: 'f4' for col in range(len(indices))})
    )
ddf_X.to_parquet(
    my_path,
    engine='pyarrow',
    schema=my_schema,
    write_metadata_file=True,
    row_group_size=my_row_group_size
    )

But it results in the error pandas.errors.IndexingError: Too many indexers. Can someone help?

I also tried to directly select the features

ddf_X = (
        ddf['X']
        .map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4' for col in range(len(indices))})
    )

Which resulted in the same error.
I also tried replacing : with slice(None), which also resulted in the same error.

英文:

I have a dask dataframe ddf with a matrix ddf['X'] and a list of indices indices. I want to select the features (columns) of ddf['X'] at the indices. My current implementation is

def subselect_variables(df):
    subset = df.iloc[:, indices]
    return subset
ddf_X = (
        ddf['X']
        .map_partitions(subselect_variables, meta={col: 'f4'for col in range(len(indices))})
    )
ddf_X.to_parquet(
    my_path,
    engine='pyarrow',
    schema=my_schema,
    write_metadata_file=True,
    row_group_size=my_row_group_size
    )

But it results in the error pandas.errors.IndexingError: Too many indexers. Can someone help?

I also tried to directly select the features

ddf_X = (
        ddf['X']
        .map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4'for col in range(len(indices))})
    )

Which resulted in the same error.
I also tried replacing : with slice(None), which also resulted in the same error.

答案1

得分: 0

感谢您的建议!它引导我走向了正确的方向。确实,如果ddf['X']是一个Series,它必须被视为一维数据。您还需要考虑元数据分配和外部函数,因为可能会耗尽内存。以下是有效的解决方案:

def subselect_variables(df):
    subset = df.map(lambda x: [x[i] for i in indices])
    return subset

ddf_X = (
        ddf['X']
        .map_partitions(subselect_series, meta=('X', 'f4'))
    )

要将其写入Parquet文件,您还需要将其转换为一个Dask DataFrame,例如:

if isinstance(ddf_X, dd.Series):
        ddf_X = ddf_X.to_frame(name='X')
英文:

Thanks for your suggestions! It led me in the right direction. Indeed, if ddf['X'] is a Series, it must be treated as 1-dimensional. What you also need to consider is the meta assignment and the external function, as you might run out of memory. Here is the solution that worked:

def subselect_variables(df):
    subset = df.map(lambda x: [x[i] for i in indices])
    return subset

ddf_X = (
        ddf['X']
        .map_partitions(subselect_series, meta=('X', 'f4'))
    )

To write it into a parquet file, you also need to cast it to a dask DataFrame, e.g. like

if isinstance(ddf_X, dd.Series):
        ddf_X = ddf_X.to_frame(name='X')

答案2

得分: -1

You are trying to index a one-dimensional thing (Series) with two dimensions or indexing. You may think it is 2D because each element is a list, but to pandas this just looks like a one-dimensional set of objects whose internals pandas knows nothing about. This has nothing to do with dask.

你试图使用两个维度或索引来索引一维数据(Series)。你可能认为它是二维的,因为每个元素都是一个列表,但对于pandas来说,这只是一组一维对象,pandas对其内部一无所知。这与dask无关。

You need to figure out how you would do this indexing in pandas before trying it in dask. Pandas is not able to index into lists in an object series. They may be a way to do that more directly with arrow, awkward, (or even numpy?) or by first expanding out the lists into columns (explode?). However the following will work, if slowly and inefficiently.

在尝试在dask中执行之前,你需要弄清楚如何在pandas中进行这种索引。Pandas不能索引对象系列中的列表。可能有一种更直接的方法来实现这个,可以使用arrow、awkward(甚至是numpy?),或者首先将列表展开为列(explode?)。然而,以下方法会工作,尽管效率较低。

Something like this grabs just the values you are after, but still keeps them in lists:

类似这样的操作只获取你想要的值,但仍然保留它们在列表中:

ddf_X = (
ddf['X']
.map(lambda value: [v for i, v in enumerate(value) if i in indices])
)

Maybe you want

也许你想要这样做

ddf_X = (
ddf['X']
.map_partitions(
lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
meta={col: 'f4' for col in range(len(indices))}
)
)

英文:

You are trying to index a one-dimensional thing (Series) with two dimensions or indexing. You may think it is 2D because each element is a list, but to pandas this just looks like a one-dimensional set of objects whose internals pandas knows nothing about. This has nothing to do with dask.

You need to figure out how you would do this indexing in pandas before trying it in dask. Pandas is not able to index into lists in an object series. They may be a way to do that more directly with arrow, awkward, (or even numpy?) or by first expanding out the lists into columns (explode?). However the following will work, if slowly and inefficiently.

Something like this grabs just the values you are after, but still keeps them in lists:

ddf_X = (
    ddf['X']
    .map(lambda value: [v for i, v in enumerate(value) if i in indices])
) 

Maybe you want

ddf_X = (
    ddf['X']
    .map_partitions(
         lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
         meta={col: 'f4'for col in range(len(indices))
    )
)

huangapple
  • 本文由 发表于 2023年6月29日 21:12:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581414.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定