英文:
How to create new columns based on a grouping method for one column in Polars?
问题
I have some data structured as shown in the first picture. Where I would like to restructure the dataframe.
Short piece of the initial data:
id | time | value |
---|---|---|
2050 | 02-01 | 20 |
2051 | 02-01 | 25 |
2050 | 02-02 | 21 |
2051 | 02-02 | 22 |
2051 | 02-03 | 23 |
The way I would like the restructured dataframe is with a timestamp column and then a column for each externallogid. I have done it with the use of pandas, but since the file is quite huge, and must be used multiple times, I would like to do it in Polars due to the speed.
Expected output:
time | 2050 | 2051 |
---|---|---|
02-01 | 20 | 25 |
02-02 | 21 | 22 |
02-03 | nan | 23 |
I have tried to use the groupby function, and join/hstack/concat. But it seems to have problems when also trying to use Lazyframes.
Thanks
To produce the data:
import polars as pl
lf = pl.DataFrame({'id': [2050, 2051, 2050, 2051, 2051],
'time': ['2023-05-01',
'2023-05-01',
'2023-05-02',
'2023-05-02',
'2023-05-03'],
'value': [20, 25, 21, 22, 23]})
lf = lf.with_column(pl.col("time").str.to_datetime("%Y-%m-%d"))
(Note: The code portion is not translated as per your request.)
英文:
I have some data structed as showed at the first picture. Where I like to restructure the dataframe.
Short piece of the initial data:
id | time | value |
---|---|---|
2050 | 02-01 | 20 |
2051 | 02-01 | 25 |
2050 | 02-02 | 21 |
2051 | 02-02 | 22 |
2051 | 02-03 | 23 |
The way I would like the restructured dataframe is with a timestamp column and then a column for each externallogid. I have done it with use of pandas, but since the file is quite huge, and must be used multiple times, I will like to do it in Polars due to the speed.
Excpected output:
time | 2050 | 2051 |
---|---|---|
02-01 | 20 | 25 |
02-02 | 21 | 22 |
02-03 | nan | 23 |
I have tried an use the groupby function, and join/hstack/concat. But seems to have problems when also trying to use Lazyframes.
Thanks
To produce the data:
import polars as pl
lf = pl.DataFrame({'id': [2050, 2051, 2050, 2051, 2051],
'time': ['2023-05-01',
'2023-05-01',
'2023-05-02',
'2023-05-02',
'2023-05-03'],
'value': [20, 25, 21, 22, 23]})
lf = lf.with_columns(pl.col("time").str.to_datetime("%Y-%m-%d"))
答案1
得分: 1
你应该进行数据透视;
In [29]: lf.pivot(columns='id', values='value', index='time', aggregate_function=None)
Out[29]:
shape: (3, 3)
┌─────────────────────┬──────┬──────┐
│ time ┆ 2050 ┆ 2051 │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪══════╪══════╡
│ 2023-05-01 00:00:00 ┆ 20 ┆ 25 │
│ 2023-05-02 00:00:00 ┆ 21 ┆ 22 │
│ 2023-05-03 00:00:00 ┆ null ┆ 23 │
└─────────────────────┴──────┴──────┘
英文:
You should pivot;
In [29]: lf.pivot(columns='id', values='value', index='time', aggregate_function=None)
Out[29]:
shape: (3, 3)
┌─────────────────────┬──────┬──────┐
│ time ┆ 2050 ┆ 2051 │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪══════╪══════╡
│ 2023-05-01 00:00:00 ┆ 20 ┆ 25 │
│ 2023-05-02 00:00:00 ┆ 21 ┆ 22 │
│ 2023-05-03 00:00:00 ┆ null ┆ 23 │
└─────────────────────┴──────┴──────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论