英文:
Setting cell values based on computation of all the rows but one
问题
在一些数据预处理过程中,我需要去除一些异常值。由于应用的性质,我不能直接移除这些数据点,所以我希望用一定范围内其他数据点的最大值来替代它们。例如,考虑以下示例:
import pandas as pd
from scipy import stats
df = pd.DataFrame({
"Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Value": [1, 2, 30, 4, 10, 200, 30, 40],
"Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
})
现在,让我们修改那些远离一个标准差的数据点(通常,我们在3倍标准差或99.8%百分位数上执行此操作,这里仅作为示例使用一个标准差):
df["zscore"] = (
df.groupby("Name")
["Value"]
.transform(lambda x: stats.zscore(x, ddof=1))
)
这将得到以下结果:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 30 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 200 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
现在,我想要替换所有zscore大于等于1.0的值,以获得以下表格:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 4 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 40 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
请注意,在索引2上,“Value”从30更改为4。在索引5上,“Value”从200更改为40。
现在,我的数据框很大(超过7800万行),我想要用最高效但仍然简短的代码来完成此任务。我尝试过这样做,但它不起作用:
indices = df["zscore"] > 1.0
df.loc[indices] = (
df[~indices]
.groupby("Name")
.max("Value")
)
这会给我以下结果:
Name Value Class zscore
0 A 1.0 S -0.593976
1 A 2.0 S -0.521979
2 NaN NaN NaN NaN
3 A 4.0 S -0.377985
4 B 10.0 X -0.685248
5 NaN NaN NaN NaN
6 B 30.0 X -0.456832
7 B 40.0 X -0.342624
那么,正确的做法是什么,保持简洁而高效?
当然,我可以稍微更详细地完成这个任务(我不知道是否是最快的方法):
for name, group in df.groupby("Name"):
indices = group["zscore"] > 1.0
df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]
这将产生我想要的结果:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 4 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 40 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
谢谢你的帮助。
英文:
During the preprocessing of some data, I need to remove some outliers. Due to the nature of the application, I cannot remove the data points themselves, so I want to replace them with the maximum of the other data points within some range. For instance, assume the following toy example:
import pandas as pd
from scipy import stats
df = pd.DataFrame({
"Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Value": [1, 2, 30, 4, 10, 200, 30, 40],
"Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
})
Now, let's modify the points that are far from one standard deviation (usually, we do it at 3x the standard deviation, or 99.8% percentile. Here, it is just one std as an example):
df[["zscore"]] = (
df.groupby(["Name"])
[["Value"]]
.transform(lambda x : stats.zscore(x, ddof=1))
)
That gives us something like:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 30 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 200 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
Now, I want to replace all values with zscore >= 1.0 to obtain the following table:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 4 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 40 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
Note that on index 2, Value
is changed from 30 to 4. In index 5, Value
changes from 200 to 40.
Now, my data frame is big (78M+ lines), and I want to do it using the most efficient, but still short code. I tried this, but it doesn't work:
indices = df["zscore"] > 1.0
df.loc[indices] = (
df[~indices]
.groupby("Name")
.max("Value")
)
which give me
Name Value Class zscore
0 A 1.0 S -0.593976
1 A 2.0 S -0.521979
2 NaN NaN NaN NaN
3 A 4.0 S -0.377985
4 B 10.0 X -0.685248
5 NaN NaN NaN NaN
6 B 30.0 X -0.456832
7 B 40.0 X -0.342624
So, what is the right way to do it, keeping it short and fast?
Of course, I can do it a little bit more verbose (I don't know whether it is the fastest way):
for name, group in df.groupby("Name"):
indices = group["zscore"] > 1.0
df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]
which produces the results I want:
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 4 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 40 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
Thanks for your help.
答案1
得分: 1
假设您有足够的内存来处理数据集,您可以首先屏蔽"Value"列中zscore大于1的值,然后按"Name"列对屏蔽后的列进行分组,并使用max函数进行转换,以广播每个分组的最大值。
m = df['zscore'] > 1
df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')
翻译好了代码,希望这对您有所帮助。
英文:
Assuming you have sufficient memory to handle dataset, you can first mask the values in the Value column where zscore is > 1 then group the masked column by Name and transform with max to broadcast the max value per group
m = df['zscore'] > 1
df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')
Name Value Class zscore
0 A 1 S -0.593976
1 A 2 S -0.521979
2 A 4 S 1.493940
3 A 4 S -0.377985
4 B 10 X -0.685248
5 B 40 X 1.484705
6 B 30 X -0.456832
7 B 40 X -0.342624
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论