英文:
Subselect features in Dask Dataframe
问题
I have a dask dataframe ddf
with a matrix ddf['X']
and a list of indices indices
. I want to select the features (columns) of ddf['X']
at the indices. My current implementation is
def subselect_variables(df):
subset = df.iloc[:, indices]
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_variables, meta={col: 'f4' for col in range(len(indices))})
)
ddf_X.to_parquet(
my_path,
engine='pyarrow',
schema=my_schema,
write_metadata_file=True,
row_group_size=my_row_group_size
)
But it results in the error pandas.errors.IndexingError: Too many indexers
. Can someone help?
I also tried to directly select the features
ddf_X = (
ddf['X']
.map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4' for col in range(len(indices))})
)
Which resulted in the same error.
I also tried replacing :
with slice(None)
, which also resulted in the same error.
英文:
I have a dask dataframe ddf
with a matrix ddf['X']
and a list of indices indices
. I want to select the features (columns) of ddf['X']
at the indices. My current implementation is
def subselect_variables(df):
subset = df.iloc[:, indices]
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_variables, meta={col: 'f4'for col in range(len(indices))})
)
ddf_X.to_parquet(
my_path,
engine='pyarrow',
schema=my_schema,
write_metadata_file=True,
row_group_size=my_row_group_size
)
But it results in the error pandas.errors.IndexingError: Too many indexers
. Can someone help?
I also tried to directly select the features
ddf_X = (
ddf['X']
.map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4'for col in range(len(indices))})
)
Which resulted in the same error.
I also tried replacing :
with slice(None)
, which also resulted in the same error.
答案1
得分: 0
感谢您的建议!它引导我走向了正确的方向。确实,如果ddf['X']
是一个Series,它必须被视为一维数据。您还需要考虑元数据分配和外部函数,因为可能会耗尽内存。以下是有效的解决方案:
def subselect_variables(df):
subset = df.map(lambda x: [x[i] for i in indices])
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_series, meta=('X', 'f4'))
)
要将其写入Parquet文件,您还需要将其转换为一个Dask DataFrame,例如:
if isinstance(ddf_X, dd.Series):
ddf_X = ddf_X.to_frame(name='X')
英文:
Thanks for your suggestions! It led me in the right direction. Indeed, if ddf['X']
is a Series, it must be treated as 1-dimensional. What you also need to consider is the meta assignment and the external function, as you might run out of memory. Here is the solution that worked:
def subselect_variables(df):
subset = df.map(lambda x: [x[i] for i in indices])
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_series, meta=('X', 'f4'))
)
To write it into a parquet file, you also need to cast it to a dask DataFrame, e.g. like
if isinstance(ddf_X, dd.Series):
ddf_X = ddf_X.to_frame(name='X')
答案2
得分: -1
You are trying to index a one-dimensional thing (Series) with two dimensions or indexing. You may think it is 2D because each element is a list, but to pandas this just looks like a one-dimensional set of objects whose internals pandas knows nothing about. This has nothing to do with dask.
你试图使用两个维度或索引来索引一维数据(Series)。你可能认为它是二维的,因为每个元素都是一个列表,但对于pandas来说,这只是一组一维对象,pandas对其内部一无所知。这与dask无关。
You need to figure out how you would do this indexing in pandas before trying it in dask. Pandas is not able to index into lists in an object series. They may be a way to do that more directly with arrow, awkward, (or even numpy?) or by first expanding out the lists into columns (explode?). However the following will work, if slowly and inefficiently.
在尝试在dask中执行之前,你需要弄清楚如何在pandas中进行这种索引。Pandas不能索引对象系列中的列表。可能有一种更直接的方法来实现这个,可以使用arrow、awkward(甚至是numpy?),或者首先将列表展开为列(explode?)。然而,以下方法会工作,尽管效率较低。
Something like this grabs just the values you are after, but still keeps them in lists:
类似这样的操作只获取你想要的值,但仍然保留它们在列表中:
ddf_X = (
ddf['X']
.map(lambda value: [v for i, v in enumerate(value) if i in indices])
)
Maybe you want
也许你想要这样做
ddf_X = (
ddf['X']
.map_partitions(
lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
meta={col: 'f4' for col in range(len(indices))}
)
)
英文:
You are trying to index a one-dimensional thing (Series) with two dimensions or indexing. You may think it is 2D because each element is a list, but to pandas this just looks like a one-dimensional set of objects whose internals pandas knows nothing about. This has nothing to do with dask.
You need to figure out how you would do this indexing in pandas before trying it in dask. Pandas is not able to index into lists in an object series. They may be a way to do that more directly with arrow, awkward, (or even numpy?) or by first expanding out the lists into columns (explode?). However the following will work, if slowly and inefficiently.
Something like this grabs just the values you are after, but still keeps them in lists:
ddf_X = (
ddf['X']
.map(lambda value: [v for i, v in enumerate(value) if i in indices])
)
Maybe you want
ddf_X = (
ddf['X']
.map_partitions(
lambda s: pd.DataFrame(np.array(s.tolist())[:, indices]),
meta={col: 'f4'for col in range(len(indices))
)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论