基于对所有行但一行的计算设置单元格值。

huangapple go评论63阅读模式
英文:

Setting cell values based on computation of all the rows but one

问题

在一些数据预处理过程中,我需要去除一些异常值。由于应用的性质,我不能直接移除这些数据点,所以我希望用一定范围内其他数据点的最大值来替代它们。例如,考虑以下示例:

import pandas as pd
from scipy import stats

df = pd.DataFrame({
    "Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Value": [1, 2, 30, 4, 10, 200, 30, 40],
    "Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
})

现在,让我们修改那些远离一个标准差的数据点(通常,我们在3倍标准差或99.8%百分位数上执行此操作,这里仅作为示例使用一个标准差):

df["zscore"] = (
    df.groupby("Name")
    ["Value"]
    .transform(lambda x: stats.zscore(x, ddof=1))
)

这将得到以下结果:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A     30     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B    200     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

现在,我想要替换所有zscore大于等于1.0的值,以获得以下表格:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

请注意,在索引2上,“Value”从30更改为4。在索引5上,“Value”从200更改为40。

现在,我的数据框很大(超过7800万行),我想要用最高效但仍然简短的代码来完成此任务。我尝试过这样做,但它不起作用:

indices = df["zscore"] > 1.0

df.loc[indices] = (
    df[~indices]
    .groupby("Name")
    .max("Value")
)

这会给我以下结果:

  Name  Value Class    zscore
0    A    1.0     S -0.593976
1    A    2.0     S -0.521979
2  NaN    NaN   NaN       NaN
3    A    4.0     S -0.377985
4    B   10.0     X -0.685248
5  NaN    NaN   NaN       NaN
6    B   30.0     X -0.456832
7    B   40.0     X -0.342624

那么,正确的做法是什么,保持简洁而高效?

当然,我可以稍微更详细地完成这个任务(我不知道是否是最快的方法):

for name, group in df.groupby("Name"):
    indices = group["zscore"] > 1.0
    df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]

这将产生我想要的结果:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

谢谢你的帮助。

英文:

During the preprocessing of some data, I need to remove some outliers. Due to the nature of the application, I cannot remove the data points themselves, so I want to replace them with the maximum of the other data points within some range. For instance, assume the following toy example:

import pandas as pd
from scipy import stats

df = pd.DataFrame({
    "Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Value": [1, 2, 30, 4, 10, 200, 30, 40],
    "Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
})

Now, let's modify the points that are far from one standard deviation (usually, we do it at 3x the standard deviation, or 99.8% percentile. Here, it is just one std as an example):

df[["zscore"]] = (
    df.groupby(["Name"])
    [["Value"]]
    .transform(lambda x : stats.zscore(x, ddof=1))
)

That gives us something like:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A     30     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B    200     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Now, I want to replace all values with zscore >= 1.0 to obtain the following table:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Note that on index 2, Value is changed from 30 to 4. In index 5, Value changes from 200 to 40.

Now, my data frame is big (78M+ lines), and I want to do it using the most efficient, but still short code. I tried this, but it doesn't work:

indices = df["zscore"] > 1.0

df.loc[indices] = (
    df[~indices]
    .groupby("Name")
    .max("Value")
)

which give me

  Name  Value Class    zscore
0    A    1.0     S -0.593976
1    A    2.0     S -0.521979
2  NaN    NaN   NaN       NaN
3    A    4.0     S -0.377985
4    B   10.0     X -0.685248
5  NaN    NaN   NaN       NaN
6    B   30.0     X -0.456832
7    B   40.0     X -0.342624

So, what is the right way to do it, keeping it short and fast?

Of course, I can do it a little bit more verbose (I don't know whether it is the fastest way):

for name, group in df.groupby("Name"):
    indices = group["zscore"] > 1.0
    df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]

which produces the results I want:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Thanks for your help.

答案1

得分: 1

假设您有足够的内存来处理数据集,您可以首先屏蔽"Value"列中zscore大于1的值,然后按"Name"列对屏蔽后的列进行分组,并使用max函数进行转换,以广播每个分组的最大值。

m = df['zscore'] > 1
df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')

翻译好了代码,希望这对您有所帮助。

英文:

Assuming you have sufficient memory to handle dataset, you can first mask the values in the Value column where zscore is > 1 then group the masked column by Name and transform with max to broadcast the max value per group

m = df['zscore'] > 1
df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

huangapple
  • 本文由 发表于 2023年2月14日 00:06:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438380.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定