英文:
How to replace level 2 in a multiIndex dataframe with values from a regular dataframe?
问题
我正在编写一个用于处理工作中客户数据分析的算法,我正在按用户ID作为我的第一级,然后按月份作为我的第二级来对数据进行分组。
然后,我对数据进行逐个用户的处理,其中包括一个月度时间序列。
我下面的代码完全做到了我需要的,除了最后一步。
如何将原始数据中的数据与输出中的处理后时间序列数据合并?
tempDF1 = pd.read_csv('data.csv', index_col=[0,1], parse_dates=[1], thousands=',')
tempDF1["Average"] = 0
tempDF1["Score"] = 0
for id, df in tempDF1.groupby(level=0):
for date in df.loc[id].index:
df.loc[(id,date),"Average"] = df.loc[(id,date)].Purchased/df.loc[(id,date)].Count
df.loc[(id,date),"Score"] = df.loc[(id,date)].Count/10*log(df.loc[(id,date)].Average, 10)
try:
list(df.loc[(id,"2021-01-01",),:])
except:
df.loc[(id, "2021-01-01",),:] = 0
try:
list(df.loc[(id,"2023-06-01",),:])
except:
df.loc[(id, "2023-06-01",),:] = 0
resampleDF = df.loc[id].resample('M', closed="left").mean().fillna(0)
print(resampleDF)
tempDF1.loc[id].update(resampleDF)
print(tempDF1)
以下是上述代码的示例输出。resampleDF 是用于 ID 1785 的,我正在尝试替换整个第二级索引,列中的数据,以及所有其他内容。我需要将 resampleDF 放入 tempDF1 中,替换掉 1785(以及其他的,但应该由上面的循环完成)。
resampleDF
月份 | 数量 | 购买 | 平均值 | 分数 |
---|---|---|---|---|
2021-01-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-02-28 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-03-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-04-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-05-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-06-30 | 1.0 | 9999.0 | 9999.0 | 0.399996 |
2021-07-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-08-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-09-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-10-31 | 2.0 | 20.0 | 10.0 | 0.200000 |
2021-11-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
tempDF1
用户ID | 月份 | 数量 | 购买 | 平均值 | 分数 |
---|---|---|---|---|---|
1785 | 2021-06-01 | 1 | 9999 | 9999.0 | 0.399996 |
2021-10-01 | 2 | 20 | 10.0 | 0.200000 | |
1857 | 2021-06-01 | 1 | 100 | 100.0 | 0.333333 |
1953 | 2021-03-01 | 5 | 250 | 50.0 | 0.398639 |
2023 | 2021-09-01 | 5 | 1100 | 220.0 | 0.599340 |
... | ... | ... | ... | ... | |
18149 | 2023-03-01 | 1 | 300 | 300.0 | 0.477121 |
18169 | 2023-03-01 | 2 | 40 | 20.0 | 0.301030 |
18177 | 2023-03-01 | 2 | 1100 | 550.0 | 0.740363 |
18210 | 2023-03-01 | 1 | 10 | 10.0 | 0.301030 |
18215 | 2023-03-01 | 2 | 110 | 55.0 | 0.397940 |
英文:
I am writing an algorithm for processing customers at work for some data analysis, and I am grouping the data by user ID as my level 1, then by month for my level 2.
I am then doing some processing work on the data user by user, which involves a monthly time series.
The code I have below does exactly what I need it to do, save for the final step.
How can I overload the data in my original with the processed time series data in the output?
tempDF1 = pd.read_csv('data.csv', index_col=[0,1], parse_dates=[1], thousands=',')
tempDF1["Average"] = 0
tempDF1["Score"] = 0
for id, df in tempDF1.groupby(level=0):
for date in df.loc[id].index:
df.loc[(id,date),"Average"] = df.loc[(id,date)].Purchased/df.loc[(id,date)].Count
df.loc[(id,date),"Score"] = df.loc[(id,date)].Count/10*log(df.loc[(id,date)].Average, 10)
try:
list(df.loc[(id,"2021-01-01",),:])
except:
df.loc[(id, "2021-01-01",),:] = 0
try:
list(df.loc[(id,"2023-06-01",),:])
except:
df.loc[(id, "2023-06-01",),:] = 0
resampleDF = df.loc[id].resample('M', closed="left").mean().fillna(0)
print(resampleDF)
tempDF1.loc[id].replace(resampleDF, inplace=True)
print(tempDF1)
Below is the sample output from the code above. The resampleDF is for ID 1785 and I am trying to replace the whole level; level 2 indexing, the data in the columns, everything wholesale. I need to put resampleDF into tempDF1 in place of what is there for 1785 (and the rest, but that should be accomplished by the above loop.
resampleDF
Month | Count | Purchased | Average | Score |
---|---|---|---|---|
2021-01-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-02-28 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-03-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-04-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-05-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-06-30 | 1.0 | 9999.0 | 9999.0 | 0.399996 |
2021-07-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-08-31 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-09-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
2021-10-31 | 2.0 | 20.0 | 10.0 | 0.200000 |
2021-11-30 | 0.0 | 0.0 | 0.0 | 0.000000 |
tempDF1
user_id | Month | Count | Purchased | Average | Score |
---|---|---|---|---|---|
1785 | 2021-06-01 | 1 | 9999 | 0 | 0 |
2021-10-01 | 2 | 20 | 0 | 0 | |
1857 | 2021-06-01 | 1 | 100 | 0 | 0 |
1953 | 2021-03-01 | 5 | 250 | 0 | 0 |
2023 | 2021-09-01 | 5 | 1100 | 0 | 0 |
... | ... | ... | ... | ... | |
18149 | 2023-03-01 | 1 | 300 | 0 | 0 |
18169 | 2023-03-01 | 2 | 40 | 0 | 0 |
18177 | 2023-03-01 | 2 | 1100 | 0 | 0 |
18210 | 2023-03-01 | 1 | 10 | 0 | 0 |
18215 | 2023-03-01 | 2 | 110 | 0 | 0 |
答案1
得分: 1
以下是翻译好的代码部分:
import pandas as pd
tempDF1 = pd.DataFrame(
{
"user_id": [1785, 1785, 1857],
"Month": ["2021-06-01", "2021-10-01", "2021-06-01"],
"Count": [1, 2, 1],
"Purchased": [9999, 20, 100],
"Average": [0, 0, 0],
"Score": [0, 0, 0],
},
).set_index(["user_id", "Month"])
print(tempDF1)
# 输出
Count Purchased Average Score
user_id Month
1785 2021-06-01 1 9999 0 0
2021-10-01 2 20 0 0
1857 2021-06-01 1 100 0 0
resampleDF = pd.DataFrame(
{
"Month": [
"2021-01-31",
"2021-02-28",
"2021-03-31",
"2021-04-30",
"2021-05-31",
"2021-06-30",
"2021-07-31",
"2021-08-31",
"2021-09-30",
"2021-10-31",
"2021-11-30",
],
"Count": [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0],
"Purchased": [0.0, 0.0, 0.0, 0.0, 0.0, 9999.0, 0.0, 0.0, 0.0, 20.0, 0.0],
"Average": [0.0, 0.0, 0.0, 0.0, 0.0, 9999.0, 0.0, 0.0, 0.0, 10.0, 0.0],
"Score": [0.0, 0.0, 0.0, 0.0, 0.0, 0.399996, 0.0, 0.0, 0.0, 0.2, 0.0],
}
)
以下是另一种使用 Pandas 的 drop、assign 和 concat 方法的代码:
tempDF1 = pd.concat(
[
resampleDF.assign(user_id=1785).set_index(["user_id", "Month"]),
tempDF1.drop(index=1785),
]
)
print(tempDF1)
# 输出
Count Purchased Average Score
user_id Month
1785 2021-01-31 0.0 0.0 0.0 0.000000
2021-02-28 0.0 0.0 0.0 0.000000
2021-03-31 0.0 0.0 0.0 0.000000
2021-04-30 0.0 0.0 0.0 0.000000
2021-05-31 0.0 0.0 0.0 0.000000
2021-06-30 1.0 9999.0 9999.0 0.399996
2021-07-31 0.0 0.0 0.0 0.000000
2021-08-31 0.0 0.0 0.0 0.000000
2021-09-30 0.0 0.0 0.0 0.000000
2021-10-31 2.0 20.0 10.0 0.200000
2021-11-30 0.0 0.0 0.0 0.000000
1857 2021-06-01 1.0 100.0 0.0 0.000000
英文:
With the dataframes you provided:
import pandas as pd
tempDF1 = pd.DataFrame(
{
"user_id": [1785, 1785, 1857],
"Month": ["2021-06-01", "2021-10-01", "2021-06-01"],
"Count": [1, 2, 1],
"Purchased": [9999, 20, 100],
"Average": [0, 0, 0],
"Score": [0, 0, 0],
},
).set_index(["user_id", "Month"])
print(tempDF1)
# Output
Count Purchased Average Score
user_id Month
1785 2021-06-01 1 9999 0 0
2021-10-01 2 20 0 0
1857 2021-06-01 1 100 0 0
resampleDF = pd.DataFrame(
{
"Month": [
"2021-01-31",
"2021-02-28",
"2021-03-31",
"2021-04-30",
"2021-05-31",
"2021-06-30",
"2021-07-31",
"2021-08-31",
"2021-09-30",
"2021-10-31",
"2021-11-30",
],
"Count": [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0],
"Purchased": [0.0, 0.0, 0.0, 0.0, 0.0, 9999.0, 0.0, 0.0, 0.0, 20.0, 0.0],
"Average": [0.0, 0.0, 0.0, 0.0, 0.0, 9999.0, 0.0, 0.0, 0.0, 10.0, 0.0],
"Score": [0.0, 0.0, 0.0, 0.0, 0.0, 0.399996, 0.0, 0.0, 0.0, 0.2, 0.0],
}
)
Here is one way to do it with Pandas drop, assign, and concat:
tempDF1 = pd.concat(
[
resampleDF.assign(user_id=1785).set_index(["user_id", "Month"]),
tempDF1.drop(index=1785),
]
)
print(tempDF1)
# Output
Count Purchased Average Score
user_id Month
1785 2021-01-31 0.0 0.0 0.0 0.000000
2021-02-28 0.0 0.0 0.0 0.000000
2021-03-31 0.0 0.0 0.0 0.000000
2021-04-30 0.0 0.0 0.0 0.000000
2021-05-31 0.0 0.0 0.0 0.000000
2021-06-30 1.0 9999.0 9999.0 0.399996
2021-07-31 0.0 0.0 0.0 0.000000
2021-08-31 0.0 0.0 0.0 0.000000
2021-09-30 0.0 0.0 0.0 0.000000
2021-10-31 2.0 20.0 10.0 0.200000
2021-11-30 0.0 0.0 0.0 0.000000
1857 2021-06-01 1.0 100.0 0.0 0.000000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论