英文:
calculating values in python using formula for a dataframe
问题
我有一个包含以下数据的Excel表格,这里我们需要根据其他列的值计算2列,但我得到了NaN值。
在Excel中,我使用以下公式:
f_organic = D2 = C2*(1+E1)
这意味着第x周的f_organic = 第x周的p_organic * (1 + (x-1)周的pct_diff)
而pct_diff是这样给出的:
pct_diff = D2/C2-1
pct_diff = 第2行的f_organic / 第2行的p_organic - 1
但当我尝试通过Python来实现时,我失败了,得到了空值。
我的代码如下:
# 填充f_organic和pct_diff列中的空值
for i in range(1, len(df)):
df.at[i, 'pct_diff'] = df.at[i, 'f_organic'] / df.at[i, 'p_organic'] - 1
df.at[i, 'f_organic'] = df.at[i, 'p_organic'] * (1 + df.at[i-1, 'pct_diff'])
# 打印最终的数据框
df
欢迎任何帮助或建议
英文:
I have an excel sheet with the data below, here we need to calculate 2 columns based on other column values but I am getting NaN values
import pandas as pd
# Define the initial data
data = {'statMonthName': ['Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
'statWeek': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'p_organic': [3646049.56, 3867696.284, 4051128.056, 4095508.5, 2778538.164, 2789640.51, 2736064.373, 3105200.772, 3112694.166],
'f_organic': [2289567, None, None, None, None, None, None, None, None],
'pct_diff': [None, None, None, None, None, None, None, None, None]}
# Create a Pandas dataframe to store the data
df = pd.DataFrame(data)
# Print the dataframe
print(df)
in excel I am using the formula
f_organic = D2 =C2*(1+E1)
which means f_organic of x week = p_organic of x week * (1+ pct_diff of (x-1) week)
and pct_diff is given as
pct_diff = =D2/C2-1
pct_diff = f_organic row2/ p_organic row 2 - 1
when I am trying to achieve this via python, I am failing and getting only null values
my code is below
# fill null values in f_organic and pct_diff columns
for i in range(1, len(df)):
df.at[i, 'pct_diff'] = df.at[i, 'f_organic'] / df.at[i, 'p_organic'] - 1
df.at[i, 'f_organic'] = df.at[i, 'p_organic'] * (1 + df.at[i-1, 'pct_diff'])
# print final dataframe
df
any help or suggestion is appreciated
答案1
得分: 1
你的DataFrame:
statMonthName statWeek p_organic f_organic pct_diff
0 Mar 1 3646049.560 2289567.0 None
1 Mar 2 3867696.284 NaN None
2 Mar 3 4051128.056 NaN None
3 Mar 4 4095508.500 NaN None
4 Apr 5 2778538.164 NaN None
5 Apr 6 2789640.510 NaN None
6 Apr 7 2736064.373 NaN None
7 Apr 8 3105200.772 NaN None
8 Apr 9 3112694.166 NaN None
你的公式(数组表示法):
f_organic[i] = p_organic[i] * (1 + pct_diff[i-1])
pct_diff[i] = (f_organic[i] / p_organic[i]) - 1
正确填充缺失值
从你的代码尝试填充缺失值来看,问题不在于公式,而在于for
循环的构建方式。
因为它从索引i=1
(第二行)开始,它会错过pct_diff[0]
(第一行)的值。因此,所有计算都涉及到空值,结果只会是NaN
。
解决方案是首先在循环之外添加一行,用于填充pct_diff[0]
的值,然后重新排列循环中的代码以反映这一点:
# 填充第一行的缺失值
df.at[0, 'pct_diff'] = df.at[0, 'f_organic'] / df.at[0, 'p_organic'] - 1
# 填充剩余的f_organic和pct_diff值
for i in range(1, len(df)):
# f_organic[i] = p_organic[i] * (1 + pct_diff[i-1])
df.at[i, 'f_organic'] = df.at[i, 'p_organic'] * (1 + df.at[i-1, 'pct_diff'])
# pct_diff[i] = (f_organic[i] / p_organic[i]) - 1
df.at[i, 'pct_diff'] = df.at[i, 'f_organic'] / df.at[i, 'p_organic'] - 1
print(df)
输出:
statMonthName statWeek p_organic f_organic pct_diff
0 Mar 1 3646049.560 2.289567e+06 -0.372042
1 Mar 2 3867696.284 2.428752e+06 -0.372042
2 Mar 3 4051128.056 2.543939e+06 -0.372042
3 Mar 4 4095508.500 2.571808e+06 -0.372042
4 Apr 5 2778538.164 1.744806e+06 -0.372042
5 Apr 6 2789640.510 1.751778e+06 -0.372042
6 Apr 7 2736064.373 1.718134e+06 -0.372042
7 Apr 8 3105200.772 1.949937e+06 -0.372042
8 Apr 9 3112694.166 1.954642e+06 -0.372042
英文:
Your DataFrame:
statMonthName statWeek p_organic f_organic pct_diff
0 Mar 1 3646049.560 2289567.0 None
1 Mar 2 3867696.284 NaN None
2 Mar 3 4051128.056 NaN None
3 Mar 4 4095508.500 NaN None
4 Apr 5 2778538.164 NaN None
5 Apr 6 2789640.510 NaN None
6 Apr 7 2736064.373 NaN None
7 Apr 8 3105200.772 NaN None
8 Apr 9 3112694.166 NaN None
Your formula (in array notation):
f_organic[i] = p_organic[i] * (1 + pct_diff[i-1])
pct_diff[i] = (f_organic[i] / p_organic[i]) - 1
Correctly filling in the missing values
From your code that tries to fill in the missing values, the problem lies not in the formula, but in the way the for
loop was constructed.
Since it starts with index i=1
(the second row), it misses the value of pct_diff[0]
(the first row). So all the calculations will involve null values, resulting only in NaN
s.
A solution would be to first add a line outside the loop that fills in pct_diff[0]
, and then reorder the code in the loop to reflect this:
# Fill in the missing value of pct_diff on the first row.
df.at[0, 'pct_diff'] = df.at[0, 'f_organic'] / df.at[0, 'p_organic'] - 1
# Fill in the remaining values of f_organic and pct_diff.
for i in range(1, len(df)):
# f_organic[i] = p_organic[i] * (1 + pct_diff[i-1])
df.at[i, 'f_organic'] = df.at[i, 'p_organic'] * (1 + df.at[i-1, 'pct_diff'])
# pct_diff[i] = (f_organic[i] / p_organic[i]) - 1
df.at[i, 'pct_diff'] = df.at[i, 'f_organic'] / df.at[i, 'p_organic'] - 1
print(df)
Output:
statMonthName statWeek p_organic f_organic pct_diff
0 Mar 1 3646049.560 2.289567e+06 -0.372042
1 Mar 2 3867696.284 2.428752e+06 -0.372042
2 Mar 3 4051128.056 2.543939e+06 -0.372042
3 Mar 4 4095508.500 2.571808e+06 -0.372042
4 Apr 5 2778538.164 1.744806e+06 -0.372042
5 Apr 6 2789640.510 1.751778e+06 -0.372042
6 Apr 7 2736064.373 1.718134e+06 -0.372042
7 Apr 8 3105200.772 1.949937e+06 -0.372042
8 Apr 9 3112694.166 1.954642e+06 -0.372042
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论