基于对所有行但一行的计算设置单元格值。

huangapple go评论84阅读模式
英文:

Setting cell values based on computation of all the rows but one

问题

在一些数据预处理过程中,我需要去除一些异常值。由于应用的性质,我不能直接移除这些数据点,所以我希望用一定范围内其他数据点的最大值来替代它们。例如,考虑以下示例:

  1. import pandas as pd
  2. from scipy import stats
  3. df = pd.DataFrame({
  4. "Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
  5. "Value": [1, 2, 30, 4, 10, 200, 30, 40],
  6. "Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
  7. })

现在,让我们修改那些远离一个标准差的数据点(通常,我们在3倍标准差或99.8%百分位数上执行此操作,这里仅作为示例使用一个标准差):

  1. df["zscore"] = (
  2. df.groupby("Name")
  3. ["Value"]
  4. .transform(lambda x: stats.zscore(x, ddof=1))
  5. )

这将得到以下结果:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 30 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 200 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

现在,我想要替换所有zscore大于等于1.0的值,以获得以下表格:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 4 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 40 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

请注意,在索引2上,“Value”从30更改为4。在索引5上,“Value”从200更改为40。

现在,我的数据框很大(超过7800万行),我想要用最高效但仍然简短的代码来完成此任务。我尝试过这样做,但它不起作用:

  1. indices = df["zscore"] > 1.0
  2. df.loc[indices] = (
  3. df[~indices]
  4. .groupby("Name")
  5. .max("Value")
  6. )

这会给我以下结果:

  1. Name Value Class zscore
  2. 0 A 1.0 S -0.593976
  3. 1 A 2.0 S -0.521979
  4. 2 NaN NaN NaN NaN
  5. 3 A 4.0 S -0.377985
  6. 4 B 10.0 X -0.685248
  7. 5 NaN NaN NaN NaN
  8. 6 B 30.0 X -0.456832
  9. 7 B 40.0 X -0.342624

那么,正确的做法是什么,保持简洁而高效?

当然,我可以稍微更详细地完成这个任务(我不知道是否是最快的方法):

  1. for name, group in df.groupby("Name"):
  2. indices = group["zscore"] > 1.0
  3. df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]

这将产生我想要的结果:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 4 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 40 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

谢谢你的帮助。

英文:

During the preprocessing of some data, I need to remove some outliers. Due to the nature of the application, I cannot remove the data points themselves, so I want to replace them with the maximum of the other data points within some range. For instance, assume the following toy example:

  1. import pandas as pd
  2. from scipy import stats
  3. df = pd.DataFrame({
  4. "Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
  5. "Value": [1, 2, 30, 4, 10, 200, 30, 40],
  6. "Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
  7. })

Now, let's modify the points that are far from one standard deviation (usually, we do it at 3x the standard deviation, or 99.8% percentile. Here, it is just one std as an example):

  1. df[["zscore"]] = (
  2. df.groupby(["Name"])
  3. [["Value"]]
  4. .transform(lambda x : stats.zscore(x, ddof=1))
  5. )

That gives us something like:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 30 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 200 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

Now, I want to replace all values with zscore >= 1.0 to obtain the following table:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 4 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 40 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

Note that on index 2, Value is changed from 30 to 4. In index 5, Value changes from 200 to 40.

Now, my data frame is big (78M+ lines), and I want to do it using the most efficient, but still short code. I tried this, but it doesn't work:

  1. indices = df["zscore"] > 1.0
  2. df.loc[indices] = (
  3. df[~indices]
  4. .groupby("Name")
  5. .max("Value")
  6. )

which give me

  1. Name Value Class zscore
  2. 0 A 1.0 S -0.593976
  3. 1 A 2.0 S -0.521979
  4. 2 NaN NaN NaN NaN
  5. 3 A 4.0 S -0.377985
  6. 4 B 10.0 X -0.685248
  7. 5 NaN NaN NaN NaN
  8. 6 B 30.0 X -0.456832
  9. 7 B 40.0 X -0.342624

So, what is the right way to do it, keeping it short and fast?

Of course, I can do it a little bit more verbose (I don't know whether it is the fastest way):

  1. for name, group in df.groupby("Name"):
  2. indices = group["zscore"] > 1.0
  3. df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]

which produces the results I want:

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 4 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 40 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

Thanks for your help.

答案1

得分: 1

假设您有足够的内存来处理数据集,您可以首先屏蔽"Value"列中zscore大于1的值,然后按"Name"列对屏蔽后的列进行分组,并使用max函数进行转换,以广播每个分组的最大值。

  1. m = df['zscore'] > 1
  2. df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')

翻译好了代码,希望这对您有所帮助。

英文:

Assuming you have sufficient memory to handle dataset, you can first mask the values in the Value column where zscore is > 1 then group the masked column by Name and transform with max to broadcast the max value per group

  1. m = df['zscore'] > 1
  2. df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')

  1. Name Value Class zscore
  2. 0 A 1 S -0.593976
  3. 1 A 2 S -0.521979
  4. 2 A 4 S 1.493940
  5. 3 A 4 S -0.377985
  6. 4 B 10 X -0.685248
  7. 5 B 40 X 1.484705
  8. 6 B 30 X -0.456832
  9. 7 B 40 X -0.342624

huangapple
  • 本文由 发表于 2023年2月14日 00:06:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438380.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定