2023年2月14日 00:06:06go评论63阅读模式

英文:

Setting cell values based on computation of all the rows but one

问题

在一些数据预处理过程中，我需要去除一些异常值。由于应用的性质，我不能直接移除这些数据点，所以我希望用一定范围内其他数据点的最大值来替代它们。例如，考虑以下示例：

import pandas as pd
from scipy import stats

df = pd.DataFrame({
    "Name": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Value": [1, 2, 30, 4, 10, 200, 30, 40],
    "Class": ["S", "S", "S", "S", "X", "X", "X", "X"]
})

现在，让我们修改那些远离一个标准差的数据点（通常，我们在3倍标准差或99.8%百分位数上执行此操作，这里仅作为示例使用一个标准差）：

df["zscore"] = (
    df.groupby("Name")
    ["Value"]
    .transform(lambda x: stats.zscore(x, ddof=1))
)

这将得到以下结果：

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A     30     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B    200     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

现在，我想要替换所有zscore大于等于1.0的值，以获得以下表格：

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

请注意，在索引2上，“Value”从30更改为4。在索引5上，“Value”从200更改为40。

现在，我的数据框很大（超过7800万行），我想要用最高效但仍然简短的代码来完成此任务。我尝试过这样做，但它不起作用：

indices = df["zscore"] > 1.0

df.loc[indices] = (
    df[~indices]
    .groupby("Name")
    .max("Value")
)

这会给我以下结果：

  Name  Value Class    zscore
0    A    1.0     S -0.593976
1    A    2.0     S -0.521979
2  NaN    NaN   NaN       NaN
3    A    4.0     S -0.377985
4    B   10.0     X -0.685248
5  NaN    NaN   NaN       NaN
6    B   30.0     X -0.456832
7    B   40.0     X -0.342624

那么，正确的做法是什么，保持简洁而高效？

当然，我可以稍微更详细地完成这个任务（我不知道是否是最快的方法）：

for name, group in df.groupby("Name"):
    indices = group["zscore"] > 1.0
    df.loc[group[indices].index, ["Value"]] = group[~indices][["Value"]].max()[0]

这将产生我想要的结果：

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

谢谢你的帮助。

英文:

During the preprocessing of some data, I need to remove some outliers. Due to the nature of the application, I cannot remove the data points themselves, so I want to replace them with the maximum of the other data points within some range. For instance, assume the following toy example:

import pandas as pd
from scipy import stats

df = pd.DataFrame({
    &quot;Name&quot;: [&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;],
    &quot;Value&quot;: [1, 2, 30, 4, 10, 200, 30, 40],
    &quot;Class&quot;: [&quot;S&quot;, &quot;S&quot;, &quot;S&quot;, &quot;S&quot;, &quot;X&quot;, &quot;X&quot;, &quot;X&quot;, &quot;X&quot;]
})

Now, let's modify the points that are far from one standard deviation (usually, we do it at 3x the standard deviation, or 99.8% percentile. Here, it is just one std as an example):

df[[&quot;zscore&quot;]] = (
    df.groupby([&quot;Name&quot;])
    [[&quot;Value&quot;]]
    .transform(lambda x : stats.zscore(x, ddof=1))
)

That gives us something like:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A     30     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B    200     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Now, I want to replace all values with zscore >= 1.0 to obtain the following table:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Note that on index 2, Value is changed from 30 to 4. In index 5, Value changes from 200 to 40.

Now, my data frame is big (78M+ lines), and I want to do it using the most efficient, but still short code. I tried this, but it doesn't work:

indices = df[&quot;zscore&quot;] &gt; 1.0

df.loc[indices] = (
    df[~indices]
    .groupby(&quot;Name&quot;)
    .max(&quot;Value&quot;)
)

which give me

  Name  Value Class    zscore
0    A    1.0     S -0.593976
1    A    2.0     S -0.521979
2  NaN    NaN   NaN       NaN
3    A    4.0     S -0.377985
4    B   10.0     X -0.685248
5  NaN    NaN   NaN       NaN
6    B   30.0     X -0.456832
7    B   40.0     X -0.342624

So, what is the right way to do it, keeping it short and fast?

Of course, I can do it a little bit more verbose (I don't know whether it is the fastest way):

for name, group in df.groupby(&quot;Name&quot;):
    indices = group[&quot;zscore&quot;] &gt; 1.0
    df.loc[group[indices].index, [&quot;Value&quot;]] = group[~indices][[&quot;Value&quot;]].max()[0]

which produces the results I want:

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

Thanks for your help.

答案1

得分: 1

假设您有足够的内存来处理数据集，您可以首先屏蔽"Value"列中zscore大于1的值，然后按"Name"列对屏蔽后的列进行分组，并使用max函数进行转换，以广播每个分组的最大值。

m = df['zscore'] > 1
df.loc[m, 'Value'] = df['Value'].mask(m).groupby(df['Name']).transform('max')

翻译好了代码，希望这对您有所帮助。

英文:

Assuming you have sufficient memory to handle dataset, you can first mask the values in the Value column where zscore is > 1 then group the masked column by Name and transform with max to broadcast the max value per group

m = df[&#39;zscore&#39;] &gt; 1
df.loc[m, &#39;Value&#39;] = df[&#39;Value&#39;].mask(m).groupby(df[&#39;Name&#39;]).transform(&#39;max&#39;)

  Name  Value Class    zscore
0    A      1     S -0.593976
1    A      2     S -0.521979
2    A      4     S  1.493940
3    A      4     S -0.377985
4    B     10     X -0.685248
5    B     40     X  1.484705
6    B     30     X -0.456832
7    B     40     X -0.342624

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于对所有行但一行的计算设置单元格值。

问题

答案1

如何在数据框中为特定列、特定日期（DatetimeIndex）更改pandas行值？

重采样数据框架产生意外结果。

如何在pyspark中根据另一列将列转换为列表

使用 Pandas 中基于日期和分类列的滚动函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论