Python DataFrame:创建一个列,根据日期和月份进行动态计算/百分比计算。

huangapple go评论82阅读模式
英文:

python dataframe: create a column, with dynamic calculation/percentage based on day and month

问题

这是我的数据框。最后两列是我可以在查询中创建的计算。

第一列用于计算所有“Good Result”的百分比

df['FirstCol'] = np.where(df['Result'].isin(['Good']), df['Vol'] / df[df['Result'] == 'Good']['Vol'].sum(), 0)

第二列用于计算所有“Result”的百分比

df['SecondCol'] = df['Vol'] / df['Vol'].sum()

对于其他列,代码需要更加动态,这是我正在努力解决的问题。

第三列应该基于每个月份获得百分比。因此,行0-8的百分比应为100%,行9-17也是如此。

第四列应该基于每天和月份获得百分比。因此,行0-2的百分比应为100%,行3-5以此类推。我想要一个动态查询,因为我不想每个月都要更改代码。

期望的输出如下:

	Day	Month	Result	Vol	FirstCol	SecondCol	ThirdCol1-9	MonthlyCol1-3
0	 26	5	Good	123	1%	        0%	        1%	                14%
1	 26	5	Bad	    716	0%	        2%	        5%	                82%
2	 26	5	Other	 36	0%	        0%	        0%	                4%
3	 26	6	Good	4721	26%	        11%	        31%	                36%
4	 26	6	Bad	    7148	0%	        16%	        48%	                54%
5	 26	6	Other	1387	0%	        3%	        9%	                10%
6	 27	5	Good	 196	1%	        0%	        1%	                22%
7	 27	5	Bad	     627	0%	        1%	        4%	                71%
8	 27	5	Other	  60	0%	        0%	        0%	                7%
9	 27	6	Good	6188	34%	        14%	        21%	                44%
10	 27	6	Bad	    6688	0%	        15%	        23%	                48%
11	 27	6	Other	1068	0%	        2%	        4%	                8%
12	 28	5	Good  	 339	2%	        1%	        1%	                22%
13	 28	5	Bad	    1114	0%	        3%	        4%	                73%
14	 28	5	Other     72	0%	        0%	        0%	                5%
15	 28	6	Good	6524	36%	        15%	        23%	                49%
16	 28	6	Bad	    6103	0%	        14%	        21%	                45%
17	 28	6	Other	 820	0%	        2%	        3%	                6%
英文:

I am struggling with the following.

This is my data frame. The last two columns, are calculations which I am able to create in my query.

	Day	Month	Result	Vol	    FirstCol	Second Col
0	 26	   5	  Good	 123	1%	        0%
1	 26	   5	  Bad	 716	0%	        2%
2	 26	   5	  Other   36	0%	        0%
3	 26	   6	  Good	4721	26%	        11%
4	 26	   6	  Bad	7148	0%	        16%
5	 26	   6	  Other	1387	0%	        3%
6	 27	   5	  Good	 196	1%	        0%
7	 27	   5	  Bad	 627	0%	        1%
8	 27	   5	  Other	  60	0%	        0%
9	 27	   6	  Good	6188	34%	        14%
10	 27	   6	  Bad	6688	0%	        15%
11	 27	   6	  Other	1068	0%	        2%
12	 28    5	  Good	 339	2%	        1%
13	 28	   5	  Bad	1114	0%	        3%
14	 28	   5	  Other   72	0%	        0%
15	 28	   6	  Good	6524	36%	        15%
16	 28	   6	  Bad	6103	0%	        14%
17	 28	   6	  Other	 820	0%	        2%

First column to calculate the percentage for all Good Result

df['FirstCol'] = np.where(df['Result'].isin(['Good']),df['Vol']/df[df['Result']=='Good']['Vol'].sum(),0) 

Second column to calculate the percentage for all Result

df['SecondCol'] = df['Vol']/df['Vol'].sum()

For the other the code has to be more dynamic, which I am struggling with.
The Third Column should get the % based on each Month. So the percentage for row 0-8 should be 100% and the same for row 9-17.
The Forth Column should get the % based on each Day and Month. So the percentage for row 0-2 should be 100% and the same for row 3-5 and so on. I want to have a dynamic query. Because I don't want to change each month.

Desired Output

	Day	Month	Result	Vol	FirstCol	Second Col 0-17	Third Col 1-9 (Month)	Forth Col 1-3 (Day)
0	 26	5	Good	123	    1%	        0%	                1%	                14%
1	 26	5	Bad	    716	    0%	        2%	                5%                	82%
2	 26	5	Other	 36	    0%	        0%	                0%                	4%
3	 26	6	Good	4721	26%	        11%	                31%                	36%
4	 26	6	Bad	    7148	0%	        16%	                48%                	54%
5	 26	6	Other	1387	0%	        3%        	        9%                	10%
6	 27	5	Good	 196	1%	        0%        	        1%	                22%
7	 27	5	Bad	     627	0%	        1%        	        4%                	71%
8	 27	5	Other	  60	0%	        0%	                0%                	7%
9	 27	6	Good	6188	34%	        14%        	        21%                	44%
10	 27	6	Bad	    6688	0%	        15%        	        23%                	48%
11	 27	6	Other	1068	0%	        2%        	        4%                	8%
12	 28	5	Good  	 339	2%	        1%        	        1%                	22%
13	 28	5	Bad	    1114	0%	        3%        	        4%                	73%
14	 28	5	Other     72	0%	        0%	                0%                	5%
15	 28	6	Good	6524	36%	        15%                	23%                	49%
16	 28	6	Bad	    6103	0%	        14%                	21%                	45%
17	 28	6	Other	 820	0%	        2%                	3%	                6%

答案1

得分: 1

# 对数据框进行排序以获得更好的输出:
df = df.sort_values(by=['月份', '日期'])

df['第三列'] = df.groupby('月份')['容量'].transform(lambda x: (x / x.sum()) * 100)
df['第四列'] = df.groupby(['日期', '月份'])['容量'].transform(lambda x: (x / x.sum()) * 100)

print(df)

输出:

    日期  月份 结果   容量 第一列 第二列      第三列      第四列
0   26   5   好   123    1%     0%  3.746573   14.057143
1   26   5   差   716    0%     2%  21.809321  81.828571
2   26   5   其他   36    0%     0%   1.096558    4.114286
6   27   5   好   196    1%     0%   5.970149   22.197055
7   27   5   差   627    0%     1%  19.098386   71.007928
8   27   5   其他   60    0%     0%   1.827597    6.795017
12  28   5   好   339    2%     1%  10.325921   22.229508
13  28   5   差  1114    0%     3%  33.932379   73.049180
14  28   5   其他   72    0%     0%   2.193116    4.721311
3   26   6   好  4721   26%    11%  11.614633   35.614062
4   26   6   差  7148    0%    16%  17.585554   53.922752
5   26   6   其他  1387    0%     3%   3.412306   10.463186
9   27   6   好  6188   34%    14%  15.223756   44.377510
10  27   6   差  6688    0%    15%  16.453859   47.963282
11  27   6   其他  1068    0%     2%   2.627500    7.659208
15  28   6   好  6524   36%    15%  16.050385   48.516398
16  28   6   差  6103    0%    14%  15.014638   45.385588
17  28   6   其他   820    0%     2%   2.017369    6.098014
英文:

If I understand you correctly, you want to groupby by multiple columns:

# sort the dataframe to have nicer output:
df = df.sort_values(by=['Month', 'Day'])

df['Third Col'] = df.groupby('Month')['Vol'].transform(lambda x: (x / x.sum()) *100)
df['Fourth Col'] = df.groupby(['Day', 'Month'])['Vol'].transform(lambda x: (x / x.sum())*100)

print(df)

Prints:

    Day  Month Result   Vol FirstCol Second Col  Third Col  Fourth Col
0    26      5   Good   123       1%         0%   3.746573   14.057143
1    26      5    Bad   716       0%         2%  21.809321   81.828571
2    26      5  Other    36       0%         0%   1.096558    4.114286
6    27      5   Good   196       1%         0%   5.970149   22.197055
7    27      5    Bad   627       0%         1%  19.098386   71.007928
8    27      5  Other    60       0%         0%   1.827597    6.795017
12   28      5   Good   339       2%         1%  10.325921   22.229508
13   28      5    Bad  1114       0%         3%  33.932379   73.049180
14   28      5  Other    72       0%         0%   2.193116    4.721311
3    26      6   Good  4721      26%        11%  11.614633   35.614062
4    26      6    Bad  7148       0%        16%  17.585554   53.922752
5    26      6  Other  1387       0%         3%   3.412306   10.463186
9    27      6   Good  6188      34%        14%  15.223756   44.377510
10   27      6    Bad  6688       0%        15%  16.453859   47.963282
11   27      6  Other  1068       0%         2%   2.627500    7.659208
15   28      6   Good  6524      36%        15%  16.050385   48.516398
16   28      6    Bad  6103       0%        14%  15.014638   45.385588
17   28      6  Other   820       0%         2%   2.017369    6.098014

huangapple
  • 本文由 发表于 2023年7月17日 23:42:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76706112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定