英文:
Pandas itertuples - fill in a matrix with a value based on an event
问题
import pandas as pd
id = [1,1,1,1,1,2,2,2,2,3,3,3,3,4,5,6,7,7,7,8,8,8,9,9,9,10,10,10]
fact = ["IC", "AC","IC","AC","IC","AC", "CC", "CD","IC","CC", "CD","IC","AC", "CD","IC","AC", "CC", "CD","IC","AC", "CC", "CD","IC","AC","IC","IC","AC","IC"]
stamp = ['1979-02-22','1973-11-06','1986-03-12','1986-01-24', '2012-05-22', '2009-01-18', '1992-01-14', '1985-06-05','2001-07-05','2008-11-19','2000-10-13','2002-04-18','1987-08-17','1977-04-09','1984-03-22','1994-08-08','2005-07-09','1982-05-03','2016-01-30','2019-03-10','1981-03-23','1979-07-21','2023-01-14','2018-06-23','1995-08-27','2020-11-08','2014-02-17','1977-09-08']
s = {"ID": id, "fact": fact, "stamp": stamp}
data = pd.DataFrame(data = s)
data.sort_values(by = "stamp", inplace= True)
facts = data.fact.unique()
structure = {'ID': [], 'stamp':[], 'fact': [], 'AC':[], 'CD':[], 'IC':[], 'CC':[]}
for row in data.itertuples():
structure["ID"].append(getattr(row, 'ID'))
structure["stamp"].append(getattr(row, 'stamp'))
structure["fact"].append(getattr(row, 'fact'))
for fact in facts:
if getattr(row, 'fact') == fact:
structure[fact].append(getattr(row, 'stamp'))
else:
structure[fact].append('na')
英文:
I am trying to create a matrix in which I fill in the date of the first occurence of an event per row after the specified stamp date in the said row.
Sample dataframe:
id = [1,1,1,1,1,2,2,2,2,3,3,3,3,4,5,6,7,7,7,8,8,8,9,9,9,10,10,10]
fact = ["IC", "AC","IC","AC","IC","AC", "CC", "CD","IC","CC", "CD","IC","AC", "CD","IC","AC", "CC", "CD","IC","AC", "CC", "CD","IC","AC","IC","IC","AC","IC"]
stamp = ['1979-02-22','1973-11-06','1986-03-12','1986-01-24', '2012-05-22', '2009-01-18', '1992-01-14', '1985-06-05','2001-07-05','2008-11-19','2000-10-13','2002-04-18','1987-08-17','1977-04-09','1984-03-22','1994-08-08','2005-07-09','1982-05-03','2016-01-30','2019-03-10','1981-03-23','1979-07-21','2023-01-14','2018-06-23','1995-08-27','2020-11-08','2014-02-17','1977-09-08']
s = {"ID": id, "fact": fact, "stamp": stamp}
data = pd.DataFrame(data = s)
data.sort_values(by = "stamp", inplace= True)
How the df looks like:
Expected output:
The code I have so far:
facts = data.fact.unique()
structure = {'ID': [], 'stamp':[], 'fact': [], 'AC':[], 'CD':[], 'IC':[], 'CC':[]}
for row in data.itertuples():
structure["ID"].append(getattr(row, 'ID'))
structure["stamp"].append(getattr(row, 'stamp'))
structure["fact"].append(getattr(row, 'fact'))
for fact in facts:
if getattr(row, 'fact') == fact:
structure[fact].append(getattr(row, 'stamp'))
else:
structure[fact].append('na')
Produces:
which is incorrect. Any help is appreciated and thank you in advance.
答案1
得分: 1
使用 [`merge_asof`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html) 函数,使用 `allow_exact_matches` 参数避免首先匹配相同的 `on` 值,然后使用 [`DataFrame.pivot_table`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) 函数进行数据透视,使用 `aggfunc='first'`,最后通过 [`DataFrame.join`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) 将结果添加到原始 DataFrame 中:
```python
data['stamp'] = pd.to_datetime(data['stamp'])
df1 = data.sort_values('stamp')
df = pd.merge_asof(df1.rename(columns={'stamp':'stamp1'}),
df1,
left_on='stamp1',
right_on='stamp',
allow_exact_matches=False,
by=['ID'],
direction='forward',
suffixes=('_','')).drop(['stamp1','fact_'],axis=1)
df1 = data.join(df.pivot_table(index='ID',
columns='fact',
values='stamp',
aggfunc='first'), on=['ID'])
英文:
Use merge_asof
with allow_exact_matches
parameter for avoid match the same on
value first, then pivoting by DataFrame.pivot_table
with aggfunc='first'
and append to original DataFrame by DataFrame.join
:
data['stamp'] = pd.to_datetime(data['stamp'])
df1 = data.sort_values('stamp')
df = pd.merge_asof(df1.rename(columns={'stamp':'stamp1'}),
df1,
left_on='stamp1',
right_on='stamp',
allow_exact_matches=False,
by=['ID'],
direction='forward',
suffixes=('_','')).drop(['stamp1','fact_'],axis=1)
df1 = data.join(df.pivot_table(index='ID',
columns='fact',
values='stamp',
aggfunc='first'), on=['ID'])
print (df1)
ID fact stamp AC CC CD IC
1 1 AC 1973-11-06 1986-01-24 NaT NaT 1979-02-22
13 4 CD 1977-04-09 NaT NaT NaT NaT
27 10 IC 1977-09-08 2014-02-17 NaT NaT 2020-11-08
0 1 IC 1979-02-22 1986-01-24 NaT NaT 1979-02-22
21 8 CD 1979-07-21 2019-03-10 1981-03-23 NaT NaT
20 8 CC 1981-03-23 2019-03-10 1981-03-23 NaT NaT
17 7 CD 1982-05-03 NaT 2005-07-09 NaT 2016-01-30
14 5 IC 1984-03-22 NaT NaT NaT NaT
7 2 CD 1985-06-05 2009-01-18 1992-01-14 NaT 2001-07-05
3 1 AC 1986-01-24 1986-01-24 NaT NaT 1979-02-22
2 1 IC 1986-03-12 1986-01-24 NaT NaT 1979-02-22
12 3 AC 1987-08-17 NaT 2008-11-19 2000-10-13 2002-04-18
6 2 CC 1992-01-14 2009-01-18 1992-01-14 NaT 2001-07-05
15 6 AC 1994-08-08 NaT NaT NaT NaT
24 9 IC 1995-08-27 2018-06-23 NaT NaT 2023-01-14
10 3 CD 2000-10-13 NaT 2008-11-19 2000-10-13 2002-04-18
8 2 IC 2001-07-05 2009-01-18 1992-01-14 NaT 2001-07-05
11 3 IC 2002-04-18 NaT 2008-11-19 2000-10-13 2002-04-18
16 7 CC 2005-07-09 NaT 2005-07-09 NaT 2016-01-30
9 3 CC 2008-11-19 NaT 2008-11-19 2000-10-13 2002-04-18
5 2 AC 2009-01-18 2009-01-18 1992-01-14 NaT 2001-07-05
4 1 IC 2012-05-22 1986-01-24 NaT NaT 1979-02-22
26 10 AC 2014-02-17 2014-02-17 NaT NaT 2020-11-08
18 7 IC 2016-01-30 NaT 2005-07-09 NaT 2016-01-30
23 9 AC 2018-06-23 2018-06-23 NaT NaT 2023-01-14
19 8 AC 2019-03-10 2019-03-10 1981-03-23 NaT NaT
25 10 IC 2020-11-08 2014-02-17 NaT NaT 2020-11-08
22 9 IC 2023-01-14 2018-06-23 NaT NaT 2023-01-14
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论