2023年4月11日 07:42:10go评论61阅读模式

英文:

Pandas: Enforcing consistent values for inner index across all outer index values

问题

以下是翻译好的代码部分：

import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df = df.set_index(["id", "ts"])
df

# 获取所有时间戳
timestamps = df.index.get_level_values("ts").unique().sort_values()

# 执行重新索引
df2 = df.reindex(timestamps, level=1, axis=0, fill_value=np.nan)

请注意，这只是代码的翻译部分。如果您有其他问题或需要进一步的帮助，请告诉我。

英文:

I have a dataset indexed by entity_id and timestamp, but certain entity_id's do not have entries at all timestamps (not missing values, just no row). I'm trying to enforce consistent timestamps across the entity_ids prior to some complicated NaN handling and resampling. But, I cannot get reindex to create the rows I was expecting, and it is leading to unexpected behavior downstream. My approach was:

import numpy as np
import pandas as pd
df = pd.DataFrame(columns = [&quot;id&quot;, &quot;ts&quot;, &quot;value&quot;])
df.loc[0,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 1]
df.loc[1,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:01&quot;), 2]
df.loc[2,:] = [2, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 3]
df = df.set_index([&quot;id&quot;, &quot;ts&quot;])
df

# Grab all the timestamps
timestamps = df.index.get_level_values(&quot;ts&quot;).unique().sort_values()

# Perform the reindexing
df2 = df.reindex(timestamps, level = 1, axis = 0, fill_value = np.nan)

However, this leaves my dataframe unchanged, i.e., df2 still only has 3 rows. Maybe reindexing isn't the right approach here, but I thought it would work.

Is there a best practice for this sort of operation?

Thank you!

答案1

得分: 1

以下是代码的翻译部分：

#添加示例数据
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df.loc[3,:] = [3, pd.Timestamp("2022-01-01 00:00:04"), 4]
df = df.set_index(["id", "ts"])
print(df)
                       value
id ts                       
1  2022-01-01 00:00:00     1
   2022-01-01 00:00:01     2
2  2022-01-01 00:00:00     3
3  2022-01-01 00:00:04     4

如果需要使用[`date_range`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)添加缺失的连续日期时间，使用最小和最大值，可以使用[`MultiIndex.from_product`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html)与所有`ids`和日期，并传递给[`DataFrame.reindex`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html)：

dates = pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq='S')

mux = pd.MultiIndex.from_product([df.index.levels[0], dates], names=df.index.names)

out1 = df.reindex(mux)
print(out1)
                       value
id ts                       
1  2022-01-01 00:00:00     1
   2022-01-01 00:00:01     2
   2022-01-01 00:00:02   NaN
   2022-01-01 00:00:03   NaN
   2022-01-01 00:00:04   NaN
2  2022-01-01 00:00:00     3
   2022-01-01 00:00:01   NaN
   2022-01-01 00:00:02   NaN
   2022-01-01 00:00:03   NaN
   2022-01-01 00:00:04   NaN
3  2022-01-01 00:00:00   NaN
   2022-01-01 00:00:01   NaN
   2022-01-01 00:00:02   NaN
   2022-01-01 00:00:03   NaN
   2022-01-01 00:00:04     4

如果需要根据`MultiIndex`的两个级别的唯一值进行[`DataFrame.reindex`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html)：

mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
out2 = df.reindex(mux)
print(out2)
                       value
id ts                       
1  2022-01-01 00:00:00     1
   2022-01-01 00:00:01     2
   2022-01-01 00:00:04   NaN
2  2022-01-01 00:00:00     3
   2022-01-01 00:00:01   NaN
   2022-01-01 00:00:04   NaN
3  2022-01-01 00:00:00   NaN
   2022-01-01 00:00:01   NaN
   2022-01-01 00:00:04     4

英文:

Use:

#added sample data
df = pd.DataFrame(columns = [&quot;id&quot;, &quot;ts&quot;, &quot;value&quot;])
df.loc[0,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 1]
df.loc[1,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:01&quot;), 2]
df.loc[2,:] = [2, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 3]
df.loc[3,:] = [3, pd.Timestamp(&quot;2022-01-01 00:00:04&quot;), 4]
df = df.set_index([&quot;id&quot;, &quot;ts&quot;])
print (df)
value
id ts                       
1  2022-01-01 00:00:00     1
2022-01-01 00:00:01     2
2  2022-01-01 00:00:00     3
3  2022-01-01 00:00:04     4

If need add missing consecutive datetimes by date_range with minimal and maximal values use MultiIndex.from_product with all ids and dates and pass to DataFrame.reindex:

dates = pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq=&#39;S&#39;)
mux = pd.MultiIndex.from_product([df.index.levels[0], dates], names=df.index.names)
out1 = df.reindex(mux)
print (out1)
value
id ts                       
1  2022-01-01 00:00:00     1
2022-01-01 00:00:01     2
2022-01-01 00:00:02   NaN
2022-01-01 00:00:03   NaN
2022-01-01 00:00:04   NaN
2  2022-01-01 00:00:00     3
2022-01-01 00:00:01   NaN
2022-01-01 00:00:02   NaN
2022-01-01 00:00:03   NaN
2022-01-01 00:00:04   NaN
3  2022-01-01 00:00:00   NaN
2022-01-01 00:00:01   NaN
2022-01-01 00:00:02   NaN
2022-01-01 00:00:03   NaN
2022-01-01 00:00:04     4

If need DataFrame.reindex by unique values of both levels of MultiIndex:

mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
out2 = df.reindex(mux)
print (out2)
value
id ts                       
1  2022-01-01 00:00:00     1
2022-01-01 00:00:01     2
2022-01-01 00:00:04   NaN
2  2022-01-01 00:00:00     3
2022-01-01 00:00:01   NaN
2022-01-01 00:00:04   NaN
3  2022-01-01 00:00:00   NaN
2022-01-01 00:00:01   NaN
2022-01-01 00:00:04     4

答案2

得分: 0

我有以下解决方案，可以利用 pd.pivot_table() 和 pd.melt()。

以下是我的代码：

# 创建示例数据集
df = pd.DataFrame(columns=["id", "ts", "value"])
df.loc[0, :] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1, :] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2, :] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]

# 数据集透视
df_pivot = pd.pivot_table(
    df,
    values='value',
    index='id',
    columns='ts'
).reset_index()

# 融合透视后的数据集
df_result = pd.melt(
    df_pivot,
    id_vars='id',
    value_vars=list(df_res.columns[1:]),
    var_name='ts', 
    value_name='value'
)

我得到的结果如下：

   id                  ts  value
0   1 2022-01-01 00:00:00    1.0
1   2 2022-01-01 00:00:00    3.0
2   1 2022-01-01 00:00:01    2.0
3   2 2022-01-01 00:00:01    NaN

如果需要，您可以使用 pd.pivot_table() 中的 fill_value 参数来填充缺失值，您可以参考文档。

希望这有所帮助。

英文:

I have this solution came into my mind to make use of pd.pivot_table() and pd.melt().

Kindly find below for my code:

# Create the sample dataset
df = pd.DataFrame(columns = [&quot;id&quot;, &quot;ts&quot;, &quot;value&quot;])
df.loc[0,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 1]
df.loc[1,:] = [1, pd.Timestamp(&quot;2022-01-01 00:00:01&quot;), 2]
df.loc[2,:] = [2, pd.Timestamp(&quot;2022-01-01 00:00:00&quot;), 3]
# Pivot the dataset
df_pivot = pd.pivot_table(
df,
values=&#39;value&#39;,
index=&#39;id&#39;,
columns=&#39;ts&#39;
).reset_index()
# Melt the pivoted dataset
df_result = pd.melt(
df_pivot,
id_vars=&#39;id&#39;,
value_vars=list(df_res.columns[1:]),
var_name=&#39;ts&#39;, 
value_name=&#39;value&#39;
)

The result I got as below:

   id                  ts  value
0   1 2022-01-01 00:00:00    1.0
1   2 2022-01-01 00:00:00    3.0
2   1 2022-01-01 00:00:01    2.0
3   2 2022-01-01 00:00:01    NaN

You can fill the missing value by using fill_value param in pd.pivot_table() if you want, you may refer the documentation.
Hope this help.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas：强制内部索引的值在所有外部索引值上保持一致。

问题

答案1

答案2

如何在由pandas.to_latex()生成的LaTeX表格中自动换行文本？

重复每列的值两次，将它们放在一起。

合并在另一个数据框中匹配的值时未能产生所期望的结果

使用pandas进行高级排序

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论