2023年6月15日 02:50:15go评论105阅读模式

英文:

Dropping rows with duplicate indexes, based on criteria

问题

I understand that you want a translation of the code and description you provided. Here's the translated code and a summary:

我正在研究我的硕士课题，遇到了一个情况，只能想出一个相当慢且不够优雅的解决方案。因此，我联系你寻求建议，如何改进这个问题。
核心是一个扩展的pandas DataFrame（我们称之为`df`以便使用），其中有一个新列用于原始索引。它的样子如下：

originalIndex A B
0 0 1 10
1 1 2 5
2 2 4 2
3 2 4 1
4 2 2 1
5 3 5 4


目标是删除所有具有`originalIndex`中重复的行，除了那些`abs(df['A'] - df['B'])`最小的行，即最小的行。因此，新的DataFrame应该如下所示：

originalIndex A B
0 0 1 10
1 1 2 5
4 2 2 1
5 3 5 4


还可以重置索引（我熟悉这部分）。
我的当前方法如下：
```python
temp = df[df.originalIndex.duplicated(keep=False) == True]
for parInd in temp.originalIndex.unique():
    duplicate_df = temp[temp.originalIndex == parInd]
    duplicate_df["Diff"] = duplicate_df.apply(lambda x: abs(x.A - x.B), axis=1)
    minInd = duplicate_df["Diff"].argmin()
    test = duplicate_df.index.drop(duplicate_df.index[minInd])
    df.drop(test)

对于一个文件中有大量重复的情况，这需要很长时间，更不用说如果我需要运行多个文件。因此，我请求你提供一种改进这个方法的方式。

更新：我编辑了示例，以使df在之前和之后都是正确的。


<details>
<summary>英文:</summary>
I&#39;m working on my masters project and ran into a situation where I could only come up with a rather slow and inelegant solution to. Thus, I reach out to you for suggestions as to how to improve this.
The core is an exploded pandas DataFrame (Let us call it `df` for ease) with a new column for the original indexes. This essentially looks as follows:

originalIndex A B
0 0 1 10
1 1 2 5
2 2 4 2
3 2 4 1
4 2 2 1
5 3 5 4


The goal is to drop all rows which have duplicates in `originalIndex` , except the one for which `abs(df[&#39;A&#39;] - df[&#39;B&#39;])` is smallest, ie, minimal. Thus, the new DataFrame ought to look like:

originalIndex A B
0 0 1 10
1 1 2 5
4 2 2 1
5 3 5 4


With the possibility of resetting the indexes (I&#39;m familiar with that part).
My current method is as such

temp = df[df.originalIndex.duplicated(keep = False) == True]
for parInd in temp.originalIndex.unique():
duplicate_df = temp[temp.originalIndex== parInd]
duplicate_df["Diff"] = duplicate_df.apply(lambda x: abs(x.A - x.B), axis=1)
minInd = duplicate_df["Diff"].argmin()
test = duplicate_df.index.drop(duplicate_df.index[minInd])
df.drop(test)


This takes ages for the large amount of duplicates a I have in one file alone, not to mention if I have to run it for more than one. Thus, I beseech you for something that improves on this ramshackle method.
Update: I&#39;ve edited the example so that `df` is correct before and after.
</details>
# 答案1
**得分**: 1
I think your example should have row 2 with values `(3, 1)` for `originalIndex == 2` instead of `(4, 1)` as you show - because this in fact has the largest difference. Your code would return `(3, 1)` and remove the other rows.
你的示例中，我认为第2行应该具有值 `(3, 1)`，而不是你展示的 `(4, 1)` - 因为实际上它具有最大的差异。你的代码将返回 `(3, 1)` 并移除其他行。
You should **always** avoid using `.apply` where possible. There are many options for computing your function that are far faster, some of which include:
在可能的情况下，你应该**始终**避免使用 `.apply`。有许多计算函数的更快选项，其中一些包括：
```python
%timeit df.apply(lambda x: abs(x.A - x.B), axis=1)
# 2.77 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit abs(df.A - df.B)
# 226 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.A.sub(df.B).abs()
# 193 µs ± 9.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df["A"].sub(df["B"]).abs()
# 173 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Secondly, you should change df.drop(test) to df.drop(test, inplace=True), so that the drop is kept. At the moment, your df is not actually changing.

其次，你应该将 df.drop(test) 更改为 df.drop(test, inplace=True)，以便保留删除操作。目前，你的 df 实际上并没有发生变化。

You could simply use the following instead:

你可以简单地使用以下方法代替：

# create difference column
df["diff"] = df["A"].sub(df["B"]).abs()
# sort by the difference, and drop duplicates in originalIndex
# (keeping the first, i.e. the row with the lowest absolute difference)
# sort by index again, so that in same order as before.
df = df.sort_values("diff", ascending=True).drop_duplicates(
    subset="originalIndex", keep="first").sort_index()
# drop the "diff" column
df.drop("diff", axis=1, inplace=True)

英文:

I think your example should have row 2 with values (3, 1) for originalIndex == 2 instead of (4, 1) as you show - because this in fact has the largest difference. Your code would return (3, 1) and remove the other rows.

You should always avoid using .apply where possible. There are many options for computing your function that are far faster, some of which include:

%timeit df.apply(lambda x: abs(x.A - x.B), axis=1)
# 2.77 ms &#177; 250 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
%timeit abs(df.A - df.B)
# 226 &#181;s &#177; 14.2 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1000 loops each)
%timeit df.A.sub(df.B).abs()
# 193 &#181;s &#177; 9.05 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10000 loops each)
%timeit df[&quot;A&quot;].sub(df[&quot;B&quot;]).abs()
# 173 &#181;s &#177; 3.39 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1000 loops each)

Secondly, you should change df.drop(test) to df.drop(test, inplace=True), so that the drop is kept. At the moment, your df is not actually changing.

You could simply use the following instead:

# create difference column
df[&quot;diff&quot;] = df[&quot;A&quot;].sub(df[&quot;B&quot;]).abs()
# sort by the difference, and drop duplicates in originalIndex
# (keeping the first, i.e. the row with the lowest absolute difference)
# sort by index again, so that in same order as before.
df = df.sort_values(&quot;diff&quot;, ascending=True).drop_duplicates(
    subset=&quot;originalIndex&quot;, keep=&quot;first&quot;).sort_index()
# drop the &quot;diff&quot; column
df.drop(&quot;diff&quot;, axis=1, inplace=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据条件删除具有重复索引的行。

问题

Python给定旧索引和新索引的字典，在列表中移动多个元素。

aws configure命令失败 – ImportError未定义的符号：sqlite3_deserialize

使用Python，我想知道如何删除文件中两个字符串之间第一次出现的字符。

AWS Airflow使用配置JSON触发任务。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。