英文:
Pandas: Enforcing consistent values for inner index across all outer index values
问题
以下是翻译好的代码部分:
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df = df.set_index(["id", "ts"])
df
# 获取所有时间戳
timestamps = df.index.get_level_values("ts").unique().sort_values()
# 执行重新索引
df2 = df.reindex(timestamps, level=1, axis=0, fill_value=np.nan)
请注意,这只是代码的翻译部分。如果您有其他问题或需要进一步的帮助,请告诉我。
英文:
I have a dataset indexed by entity_id and timestamp, but certain entity_id's do not have entries at all timestamps (not missing values, just no row). I'm trying to enforce consistent timestamps across the entity_ids prior to some complicated NaN handling and resampling. But, I cannot get reindex to create the rows I was expecting, and it is leading to unexpected behavior downstream. My approach was:
import numpy as np
import pandas as pd
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df = df.set_index(["id", "ts"])
df
# Grab all the timestamps
timestamps = df.index.get_level_values("ts").unique().sort_values()
# Perform the reindexing
df2 = df.reindex(timestamps, level = 1, axis = 0, fill_value = np.nan)
However, this leaves my dataframe unchanged, i.e., df2 still only has 3 rows. Maybe reindexing isn't the right approach here, but I thought it would work.
Is there a best practice for this sort of operation?
Thank you!
答案1
得分: 1
以下是代码的翻译部分:
#添加示例数据
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df.loc[3,:] = [3, pd.Timestamp("2022-01-01 00:00:04"), 4]
df = df.set_index(["id", "ts"])
print(df)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2 2022-01-01 00:00:00 3
3 2022-01-01 00:00:04 4
如果需要使用[`date_range`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)添加缺失的连续日期时间,使用最小和最大值,可以使用[`MultiIndex.from_product`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html)与所有`ids`和日期,并传递给[`DataFrame.reindex`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html):
dates = pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq='S')
mux = pd.MultiIndex.from_product([df.index.levels[0], dates], names=df.index.names)
out1 = df.reindex(mux)
print(out1)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 NaN
2 2022-01-01 00:00:00 3
2022-01-01 00:00:01 NaN
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 NaN
3 2022-01-01 00:00:00 NaN
2022-01-01 00:00:01 NaN
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 4
如果需要根据`MultiIndex`的两个级别的唯一值进行[`DataFrame.reindex`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html):
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
out2 = df.reindex(mux)
print(out2)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2022-01-01 00:00:04 NaN
2 2022-01-01 00:00:00 3
2022-01-01 00:00:01 NaN
2022-01-01 00:00:04 NaN
3 2022-01-01 00:00:00 NaN
2022-01-01 00:00:01 NaN
2022-01-01 00:00:04 4
英文:
Use:
#added sample data
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df.loc[3,:] = [3, pd.Timestamp("2022-01-01 00:00:04"), 4]
df = df.set_index(["id", "ts"])
print (df)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2 2022-01-01 00:00:00 3
3 2022-01-01 00:00:04 4
If need add missing consecutive datetimes by date_range
with minimal and maximal values use MultiIndex.from_product
with all ids
and dates and pass to DataFrame.reindex
:
dates = pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq='S')
mux = pd.MultiIndex.from_product([df.index.levels[0], dates], names=df.index.names)
out1 = df.reindex(mux)
print (out1)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 NaN
2 2022-01-01 00:00:00 3
2022-01-01 00:00:01 NaN
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 NaN
3 2022-01-01 00:00:00 NaN
2022-01-01 00:00:01 NaN
2022-01-01 00:00:02 NaN
2022-01-01 00:00:03 NaN
2022-01-01 00:00:04 4
If need DataFrame.reindex
by unique values of both levels of MultiIndex
:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
out2 = df.reindex(mux)
print (out2)
value
id ts
1 2022-01-01 00:00:00 1
2022-01-01 00:00:01 2
2022-01-01 00:00:04 NaN
2 2022-01-01 00:00:00 3
2022-01-01 00:00:01 NaN
2022-01-01 00:00:04 NaN
3 2022-01-01 00:00:00 NaN
2022-01-01 00:00:01 NaN
2022-01-01 00:00:04 4
答案2
得分: 0
我有以下解决方案,可以利用 pd.pivot_table()
和 pd.melt()
。
以下是我的代码:
# 创建示例数据集
df = pd.DataFrame(columns=["id", "ts", "value"])
df.loc[0, :] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1, :] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2, :] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
# 数据集透视
df_pivot = pd.pivot_table(
df,
values='value',
index='id',
columns='ts'
).reset_index()
# 融合透视后的数据集
df_result = pd.melt(
df_pivot,
id_vars='id',
value_vars=list(df_res.columns[1:]),
var_name='ts',
value_name='value'
)
我得到的结果如下:
id ts value
0 1 2022-01-01 00:00:00 1.0
1 2 2022-01-01 00:00:00 3.0
2 1 2022-01-01 00:00:01 2.0
3 2 2022-01-01 00:00:01 NaN
如果需要,您可以使用 pd.pivot_table()
中的 fill_value
参数来填充缺失值,您可以参考文档。
希望这有所帮助。
英文:
I have this solution came into my mind to make use of pd.pivot_table()
and pd.melt()
.
Kindly find below for my code:
# Create the sample dataset
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
# Pivot the dataset
df_pivot = pd.pivot_table(
df,
values='value',
index='id',
columns='ts'
).reset_index()
# Melt the pivoted dataset
df_result = pd.melt(
df_pivot,
id_vars='id',
value_vars=list(df_res.columns[1:]),
var_name='ts',
value_name='value'
)
The result I got as below:
id ts value
0 1 2022-01-01 00:00:00 1.0
1 2 2022-01-01 00:00:00 3.0
2 1 2022-01-01 00:00:01 2.0
3 2 2022-01-01 00:00:01 NaN
You can fill the missing value by using fill_value
param in pd.pivot_table()
if you want, you may refer the documentation.
Hope this help.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论