Python – 根据条件生成具有值的列

huangapple go评论88阅读模式
英文:

Python - generate columns with values based on a condition

问题

我有一些数据,如图像左侧所示,包含列组、日期、开盘价、收盘价和退出。

我想要生成另外两列(如右侧所示),分别是date_open和date_close。

逻辑是对于每个组,date_open应该是数据中的第一个条目(对于该组),而对于date_close,仅当退出为1时,它才应该是最后一个条目,否则没有退出日期(0或NaN)。

我尝试使用for循环、索引和enumerate,但无法获得所需的输出。有没有办法获得所需的输出?

英文:

I have some data like shown in the left side of the image with columns like group, date, open, close, exit.

What I want to do is generate 2 more columns (like shown on the right) with date_open and date_close.

The logic is that for each group, the date_open should be the first entry in the data (for that group) and, for date_close it should be the last entry only when exit is 1 else there is no exit date (0 or NaN).

I tried it using for loop and index and enumerate but could not acheive the output. Is there any way to get the desired output?

Python – 根据条件生成具有值的列

group	date	open	close	exit
A	   Jan-22	 2    	 0	     0 
A	   Feb-22	 2	     0	     0
A	   Mar-22	 2	     0	     0
A	   Apr-22	 1	     1	     0
A	   May-22	 1	     1	     0
A	   Jun-22	 0	     2	     1
B	   Mar-23	 3	     0	     0
B	   Apr-23	 3	     0	     0
B	   May-23	 3	     0	     0
B	   Jun-23	 3	     1	     0
C	   Sep-21	 1	     0	     0
C	   Oct-21	 1	     0	     0
C	   Nov-21	 1	     0	     0
C	   Dec-21	 1	     0	     0
C	   Jan-22	 0	     1	     1

答案1

得分: 2

以下是您要翻译的内容:

In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')

In [30]: df
Out[30]:
   group    date  open  close  exit
0      A  Jan-22     2      0     0
1      A  Feb-22     2      0     0
2      A  Mar-22     2      0     0
3      A  Apr-22     1      1     0
4      A  May-22     1      1     0
5      A  Jun-22     0      2     1
6      B  Mar-23     3      0     0
7      B  Apr-23     3      0     0
8      B  May-23     3      0     0
9      B  Jun-23     3      1     0
10     C  Sep-21     1      0     0
11     C  Oct-21     1      0     0
12     C  Nov-21     1      0     0
13     C  Dec-21     1      0     0
14     C  Jan-22     0      1     1

In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))

In [32]: df
Out[32]:
   group       date  open  close  exit
0      A 2022-01-01     2      0     0
1      A 2022-02-01     2      0     0
2      A 2022-03-01     2      0     0
3      A 2022-04-01     1      1     0
4      A 2022-05-01     1      1     0
5      A 2022-06-01     0      2     1
6      B 2023-03-01     3      0     0
7      B 2023-04-01     3      0     0
8      B 2023-05-01     3      0     0
9      B 2023-06-01     3      1     0
10     C 2021-09-01     1      0     0
11     C 2021-10-01     1      0     0
12     C 2021-11-01     1      0     0
13     C 2021-12-01     1      0     0
14     C 2022-01-01     0      1     1

In [33]: opens = df[['group', 'date']].groupby(by='group').min()

In [34]: opens
Out[34]:
            date
group
A     2022-01-01
B     2023-03-01
C     2021-09-01

In [35]: opens.rename(columns={'date': 'date_open'})
Out[35]:
       date_open
group
A     2022-01-01
B     2023-03-01
C     2021-09-01

In [36]: opens = opens.rename(columns={'date': 'date_open'})

In [37]: opens
Out[37]:
       date_open
group
A     2022-01-01
B     2023-03-01
C     2021-09-01

In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()

In [39]: closes = closes.rename(columns={'date': 'date_close'})

In [40]: closes
Out[40]:
      date_close
group
A     2022-06-01
C     2022-01-01

In [41]: opens.join(df.set_index('group')).join(closes)
Out[41]:
       date_open       date  open  close  exit date_close
group
A     2022-01-01 2022-01-01     2      0     0 2022-06-01
A     2022-01-01 2022-02-01     2      0     0 2022-06-01
A     2022-01-01 2022-03-01     2      0     0 2022-06-01
A     2022-01-01 2022-04-01     1      1     0 2022-06-01
A     2022-01-01 2022-05-01     1      1     0 2022-06-01
A     2022-01-01 2022-06-01     0      2     1 2022-06-01
B     2023-03-01 2023-03-01     3      0     0        NaT
B     2023-03-01 2023-04-01     3      0     0        NaT
B     2023-03-01 2023-05-01     3      0     0        NaT
B     2023-03-01 2023-06-01     3      1     0        NaT
C     2021-09-01 2021-09-01     1      0     0 2022-01-01
C     2021-09-01 2021-10-01     1      0     0 2022-01-01
C     2021-09-01 2021-11-01     1      0     0 2022-01-01
C     2021-09-01 2021-12-01     1      0     0 2022-01-01
C     2021-09-01 2022-01-01     0      1     1 2022-01-01

<details>
<summary>英文:</summary>

In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')

In [30]: df
Out[30]:
group date open close exit
0 A Jan-22 2 0 0
1 A Feb-22 2 0 0
2 A Mar-22 2 0 0
3 A Apr-22 1 1 0
4 A May-22 1 1 0
5 A Jun-22 0 2 1
6 B Mar-23 3 0 0
7 B Apr-23 3 0 0
8 B May-23 3 0 0
9 B Jun-23 3 1 0
10 C Sep-21 1 0 0
11 C Oct-21 1 0 0
12 C Nov-21 1 0 0
13 C Dec-21 1 0 0
14 C Jan-22 0 1 1

In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))

In [32]: df
Out[32]:
group date open close exit
0 A 2022-01-01 2 0 0
1 A 2022-02-01 2 0 0
2 A 2022-03-01 2 0 0
3 A 2022-04-01 1 1 0
4 A 2022-05-01 1 1 0
5 A 2022-06-01 0 2 1
6 B 2023-03-01 3 0 0
7 B 2023-04-01 3 0 0
8 B 2023-05-01 3 0 0
9 B 2023-06-01 3 1 0
10 C 2021-09-01 1 0 0
11 C 2021-10-01 1 0 0
12 C 2021-11-01 1 0 0
13 C 2021-12-01 1 0 0
14 C 2022-01-01 0 1 1

In [33]: opens = df[['group', 'date']].groupby(by='group').min()

In [34]: opens
Out[34]:
date
group
A 2022-01-01
B 2023-03-01
C 2021-09-01

In [35]: opens.rename(columns={'date': 'date_open'})
Out[35]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01

In [36]: opens = opens.rename(columns={'date': 'date_open'})

In [37]: opens
Out[37]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01

In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()

In [39]: closes = closes.rename(columns={'date': 'date_close'})

In [40]: closes
Out[40]:
date_close
group
A 2022-06-01
C 2022-01-01

In [41]: opens.join(df.set_index('group')).join(closes)
Out[41]:
date_open date open close exit date_close
group
A 2022-01-01 2022-01-01 2 0 0 2022-06-01
A 2022-01-01 2022-02-01 2 0 0 2022-06-01
A 2022-01-01 2022-03-01 2 0 0 2022-06-01
A 2022-01-01 2022-04-01 1 1 0 2022-06-01
A 2022-01-01 2022-05-01 1 1 0 2022-06-01
A 2022-01-01 2022-06-01 0 2 1 2022-06-01
B 2023-03-01 2023-03-01 3 0 0 NaT
B 2023-03-01 2023-04-01 3 0 0 NaT
B 2023-03-01 2023-05-01 3 0 0 NaT
B 2023-03-01 2023-06-01 3 1 0 NaT
C 2021-09-01 2021-09-01 1 0 0 2022-01-01
C 2021-09-01 2021-10-01 1 0 0 2022-01-01
C 2021-09-01 2021-11-01 1 0 0 2022-01-01
C 2021-09-01 2021-12-01 1 0 0 2022-01-01
C 2021-09-01 2022-01-01 0 1 1 2022-01-01


</details>



# 答案2
**得分**: 1

使用 [`groupby.transform`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html) 的一种选项:

```python
df['date_open'] = df.groupby('group')['date'].transform('first')
df['date_close'] = (df['date'].where(df['exit'].gt(0), '')
                    .groupby(df['group']).transform('last')
                    #.replace('', float('nan')) # 可选项 
                   )

输出:

   group    date  open  close  exit date_open date_close
0      A  Jan-22     2      0     0    Jan-22     Jun-22
1      A  Feb-22     2      0     0    Jan-22     Jun-22
2      A  Mar-22     2      0     0    Jan-22     Jun-22
3      A  Apr-22     1      1     0    Jan-22     Jun-22
4      A  May-22     1      1     0    Jan-22     Jun-22
5      A  Jun-22     0      2     1    Jan-22     Jun-22
6      B  Mar-23     3      0     0    Mar-23           
7      B  Apr-23     3      0     0    Mar-23           
8      B  May-23     3      0     0    Mar-23           
9      B  Jun-23     3      1     0    Mar-23           
10     C  Sep-21     1      0     0    Sep-21     Jan-22
11     C  Oct-21     1      0     0    Sep-21     Jan-22
12     C  Nov-21     1      0     0    Sep-21     Jan-22
13     C  Dec-21     1      0     0    Sep-21     Jan-22
14     C  Jan-22     0      1     1    Sep-21     Jan-22
英文:

One option using groupby.transform:

df[&#39;date_open&#39;] = df.groupby(&#39;group&#39;)[&#39;date&#39;].transform(&#39;first&#39;)
df[&#39;date_close&#39;] = (df[&#39;date&#39;].where(df[&#39;exit&#39;].gt(0), &#39;&#39;)
                    .groupby(df[&#39;group&#39;]).transform(&#39;last&#39;)
                    #.replace(&#39;&#39;, float(&#39;nan&#39;)) # optional 
                   )

Output:

   group    date  open  close  exit date_open date_close
0      A  Jan-22     2      0     0    Jan-22     Jun-22
1      A  Feb-22     2      0     0    Jan-22     Jun-22
2      A  Mar-22     2      0     0    Jan-22     Jun-22
3      A  Apr-22     1      1     0    Jan-22     Jun-22
4      A  May-22     1      1     0    Jan-22     Jun-22
5      A  Jun-22     0      2     1    Jan-22     Jun-22
6      B  Mar-23     3      0     0    Mar-23           
7      B  Apr-23     3      0     0    Mar-23           
8      B  May-23     3      0     0    Mar-23           
9      B  Jun-23     3      1     0    Mar-23           
10     C  Sep-21     1      0     0    Sep-21     Jan-22
11     C  Oct-21     1      0     0    Sep-21     Jan-22
12     C  Nov-21     1      0     0    Sep-21     Jan-22
13     C  Dec-21     1      0     0    Sep-21     Jan-22
14     C  Jan-22     0      1     1    Sep-21     Jan-22

huangapple
  • 本文由 发表于 2023年6月29日 22:20:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581950.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定