英文:
Python - generate columns with values based on a condition
问题
我有一些数据,如图像左侧所示,包含列组、日期、开盘价、收盘价和退出。
我想要生成另外两列(如右侧所示),分别是date_open和date_close。
逻辑是对于每个组,date_open应该是数据中的第一个条目(对于该组),而对于date_close,仅当退出为1时,它才应该是最后一个条目,否则没有退出日期(0或NaN)。
我尝试使用for循环、索引和enumerate,但无法获得所需的输出。有没有办法获得所需的输出?
英文:
I have some data like shown in the left side of the image with columns like group, date, open, close, exit.
What I want to do is generate 2 more columns (like shown on the right) with date_open and date_close.
The logic is that for each group, the date_open should be the first entry in the data (for that group) and, for date_close it should be the last entry only when exit is 1 else there is no exit date (0 or NaN).
I tried it using for loop and index and enumerate but could not acheive the output. Is there any way to get the desired output?
group date open close exit
A Jan-22 2 0 0
A Feb-22 2 0 0
A Mar-22 2 0 0
A Apr-22 1 1 0
A May-22 1 1 0
A Jun-22 0 2 1
B Mar-23 3 0 0
B Apr-23 3 0 0
B May-23 3 0 0
B Jun-23 3 1 0
C Sep-21 1 0 0
C Oct-21 1 0 0
C Nov-21 1 0 0
C Dec-21 1 0 0
C Jan-22 0 1 1
答案1
得分: 2
以下是您要翻译的内容:
In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')
In [30]: df
Out[30]:
group date open close exit
0 A Jan-22 2 0 0
1 A Feb-22 2 0 0
2 A Mar-22 2 0 0
3 A Apr-22 1 1 0
4 A May-22 1 1 0
5 A Jun-22 0 2 1
6 B Mar-23 3 0 0
7 B Apr-23 3 0 0
8 B May-23 3 0 0
9 B Jun-23 3 1 0
10 C Sep-21 1 0 0
11 C Oct-21 1 0 0
12 C Nov-21 1 0 0
13 C Dec-21 1 0 0
14 C Jan-22 0 1 1
In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))
In [32]: df
Out[32]:
group date open close exit
0 A 2022-01-01 2 0 0
1 A 2022-02-01 2 0 0
2 A 2022-03-01 2 0 0
3 A 2022-04-01 1 1 0
4 A 2022-05-01 1 1 0
5 A 2022-06-01 0 2 1
6 B 2023-03-01 3 0 0
7 B 2023-04-01 3 0 0
8 B 2023-05-01 3 0 0
9 B 2023-06-01 3 1 0
10 C 2021-09-01 1 0 0
11 C 2021-10-01 1 0 0
12 C 2021-11-01 1 0 0
13 C 2021-12-01 1 0 0
14 C 2022-01-01 0 1 1
In [33]: opens = df[['group', 'date']].groupby(by='group').min()
In [34]: opens
Out[34]:
date
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [35]: opens.rename(columns={'date': 'date_open'})
Out[35]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [36]: opens = opens.rename(columns={'date': 'date_open'})
In [37]: opens
Out[37]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()
In [39]: closes = closes.rename(columns={'date': 'date_close'})
In [40]: closes
Out[40]:
date_close
group
A 2022-06-01
C 2022-01-01
In [41]: opens.join(df.set_index('group')).join(closes)
Out[41]:
date_open date open close exit date_close
group
A 2022-01-01 2022-01-01 2 0 0 2022-06-01
A 2022-01-01 2022-02-01 2 0 0 2022-06-01
A 2022-01-01 2022-03-01 2 0 0 2022-06-01
A 2022-01-01 2022-04-01 1 1 0 2022-06-01
A 2022-01-01 2022-05-01 1 1 0 2022-06-01
A 2022-01-01 2022-06-01 0 2 1 2022-06-01
B 2023-03-01 2023-03-01 3 0 0 NaT
B 2023-03-01 2023-04-01 3 0 0 NaT
B 2023-03-01 2023-05-01 3 0 0 NaT
B 2023-03-01 2023-06-01 3 1 0 NaT
C 2021-09-01 2021-09-01 1 0 0 2022-01-01
C 2021-09-01 2021-10-01 1 0 0 2022-01-01
C 2021-09-01 2021-11-01 1 0 0 2022-01-01
C 2021-09-01 2021-12-01 1 0 0 2022-01-01
C 2021-09-01 2022-01-01 0 1 1 2022-01-01
<details>
<summary>英文:</summary>
In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')
In [30]: df
Out[30]:
group date open close exit
0 A Jan-22 2 0 0
1 A Feb-22 2 0 0
2 A Mar-22 2 0 0
3 A Apr-22 1 1 0
4 A May-22 1 1 0
5 A Jun-22 0 2 1
6 B Mar-23 3 0 0
7 B Apr-23 3 0 0
8 B May-23 3 0 0
9 B Jun-23 3 1 0
10 C Sep-21 1 0 0
11 C Oct-21 1 0 0
12 C Nov-21 1 0 0
13 C Dec-21 1 0 0
14 C Jan-22 0 1 1
In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))
In [32]: df
Out[32]:
group date open close exit
0 A 2022-01-01 2 0 0
1 A 2022-02-01 2 0 0
2 A 2022-03-01 2 0 0
3 A 2022-04-01 1 1 0
4 A 2022-05-01 1 1 0
5 A 2022-06-01 0 2 1
6 B 2023-03-01 3 0 0
7 B 2023-04-01 3 0 0
8 B 2023-05-01 3 0 0
9 B 2023-06-01 3 1 0
10 C 2021-09-01 1 0 0
11 C 2021-10-01 1 0 0
12 C 2021-11-01 1 0 0
13 C 2021-12-01 1 0 0
14 C 2022-01-01 0 1 1
In [33]: opens = df[['group', 'date']].groupby(by='group').min()
In [34]: opens
Out[34]:
date
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [35]: opens.rename(columns={'date': 'date_open'})
Out[35]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [36]: opens = opens.rename(columns={'date': 'date_open'})
In [37]: opens
Out[37]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()
In [39]: closes = closes.rename(columns={'date': 'date_close'})
In [40]: closes
Out[40]:
date_close
group
A 2022-06-01
C 2022-01-01
In [41]: opens.join(df.set_index('group')).join(closes)
Out[41]:
date_open date open close exit date_close
group
A 2022-01-01 2022-01-01 2 0 0 2022-06-01
A 2022-01-01 2022-02-01 2 0 0 2022-06-01
A 2022-01-01 2022-03-01 2 0 0 2022-06-01
A 2022-01-01 2022-04-01 1 1 0 2022-06-01
A 2022-01-01 2022-05-01 1 1 0 2022-06-01
A 2022-01-01 2022-06-01 0 2 1 2022-06-01
B 2023-03-01 2023-03-01 3 0 0 NaT
B 2023-03-01 2023-04-01 3 0 0 NaT
B 2023-03-01 2023-05-01 3 0 0 NaT
B 2023-03-01 2023-06-01 3 1 0 NaT
C 2021-09-01 2021-09-01 1 0 0 2022-01-01
C 2021-09-01 2021-10-01 1 0 0 2022-01-01
C 2021-09-01 2021-11-01 1 0 0 2022-01-01
C 2021-09-01 2021-12-01 1 0 0 2022-01-01
C 2021-09-01 2022-01-01 0 1 1 2022-01-01
</details>
# 答案2
**得分**: 1
使用 [`groupby.transform`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html) 的一种选项:
```python
df['date_open'] = df.groupby('group')['date'].transform('first')
df['date_close'] = (df['date'].where(df['exit'].gt(0), '')
.groupby(df['group']).transform('last')
#.replace('', float('nan')) # 可选项
)
输出:
group date open close exit date_open date_close
0 A Jan-22 2 0 0 Jan-22 Jun-22
1 A Feb-22 2 0 0 Jan-22 Jun-22
2 A Mar-22 2 0 0 Jan-22 Jun-22
3 A Apr-22 1 1 0 Jan-22 Jun-22
4 A May-22 1 1 0 Jan-22 Jun-22
5 A Jun-22 0 2 1 Jan-22 Jun-22
6 B Mar-23 3 0 0 Mar-23
7 B Apr-23 3 0 0 Mar-23
8 B May-23 3 0 0 Mar-23
9 B Jun-23 3 1 0 Mar-23
10 C Sep-21 1 0 0 Sep-21 Jan-22
11 C Oct-21 1 0 0 Sep-21 Jan-22
12 C Nov-21 1 0 0 Sep-21 Jan-22
13 C Dec-21 1 0 0 Sep-21 Jan-22
14 C Jan-22 0 1 1 Sep-21 Jan-22
英文:
One option using groupby.transform
:
df['date_open'] = df.groupby('group')['date'].transform('first')
df['date_close'] = (df['date'].where(df['exit'].gt(0), '')
.groupby(df['group']).transform('last')
#.replace('', float('nan')) # optional
)
Output:
group date open close exit date_open date_close
0 A Jan-22 2 0 0 Jan-22 Jun-22
1 A Feb-22 2 0 0 Jan-22 Jun-22
2 A Mar-22 2 0 0 Jan-22 Jun-22
3 A Apr-22 1 1 0 Jan-22 Jun-22
4 A May-22 1 1 0 Jan-22 Jun-22
5 A Jun-22 0 2 1 Jan-22 Jun-22
6 B Mar-23 3 0 0 Mar-23
7 B Apr-23 3 0 0 Mar-23
8 B May-23 3 0 0 Mar-23
9 B Jun-23 3 1 0 Mar-23
10 C Sep-21 1 0 0 Sep-21 Jan-22
11 C Oct-21 1 0 0 Sep-21 Jan-22
12 C Nov-21 1 0 0 Sep-21 Jan-22
13 C Dec-21 1 0 0 Sep-21 Jan-22
14 C Jan-22 0 1 1 Sep-21 Jan-22
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论