2023年6月15日 03:23:18go评论122阅读模式

英文:

Python Pandas: Select multiple related rows from dataframe using comparisons across rows

问题

为每个ID提取具有最近日期的测试，并且如果有两个具有最近日期的测试，则选择具有最多成功次数的测试。因此，选择最近且最成功的测试，并呈现该测试的所有结果，你可以使用以下代码来实现：

import pandas as pd
# Your initial DataFrame
data = {
    'ID': ["A", "A", "A", "B", "B", "B", "C"],
    'Test': ["e2z", "e2z", "b6r", "p0o", "r5t", "qi4", "x3w"],
    'Date': ["2022", "2022", "2020", "2019", "2019", "2018", "2023"],
    'Success': ['1', '0', '1', '0', '1', '0', '0'],
    'Experiment Parameters': ["awa", "02s", "ksa", "fkd", "efe", "awe", "loa"]
}
df = pd.DataFrame(data)
# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y')
# Sort the DataFrame by 'Date' and 'Success'
df = df.sort_values(by=['ID', 'Date', 'Success'], ascending=[True, False, False])
# Group by 'ID' and select the first row from each group
result = df.groupby('ID').first().reset_index()
# Print the result DataFrame
print(result)

这将为每个ID提取具有最近日期的测试，并在有多个具有最近日期的测试时选择具有最多成功次数的测试。最后，你将得到所需的数据框。

英文:

I have data like this:

In[1]: pd.DataFrame({&#39;ID&#39;:[&quot;A&quot;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;C&#39;], &#39;Test&#39;:[&quot;e2z&quot;, &#39;e2z&#39;, &#39;b6r&#39;, &#39;p0o&#39;, &#39;r5t&#39;, &#39;qi4&#39;,&#39;x3w&#39;], &#39;Date&#39;:[&quot;2022&quot;, &#39;2022&#39;, &#39;2020&#39;, &#39;2019&#39;, &#39;2019&#39;, &#39;2018&#39;, &#39;2023&#39;], &#39;Success&#39;:[&#39;1&#39;, &#39;0&#39;, &#39;1&#39;, &#39;0&#39;, &#39;1&#39;, &#39;0&#39;,&#39;0&#39;], &#39;Experiment Parameters&#39;: [&quot;awa&quot;, &#39;02s&#39;, &#39;ksa&#39;, &#39;fkd&#39;, &#39;efe&#39;, &#39;awe&#39;,&#39;loa&#39;]})
Out[1]:  
   ID	Test Date	Success	Experiment Parameters
0	A	e2z	2022	1	    awa
1	A	e2z	2022	0	    02s
2	A	b6r	2020	1	    ksa
3	B	p0o	2019	0	    fkd
4	B	r5t	2019	1	    efe
5	B	qi4	2018	0	    awe
6	C	x3w	2023	0	    loa

Each row presents a finding from the corresponding test.

I need code that will, for each ID, extract out the test with most recent date. If there are two tests with the most recent dates, the test with the most total successes should be selected. Therefore, the most recent and most successful test is selected, presenting all the findings from that test.

In this example data, I want the output to be:

In[2]: pd.DataFrame({&#39;ID&#39;:[&quot;A&quot;, &#39;A&#39;, &#39;B&#39;,&#39;C&#39;], &#39;Test&#39;:[&quot;e2z&quot;, &#39;e2z&#39;, &#39;r5t&#39;, &#39;x3w&#39;], &#39;Date&#39;:[&quot;2022&quot;, &#39;2022&#39;, &#39;2019&#39;,   &#39;2023&#39;], &#39;Success&#39;:[&#39;1&#39;, &#39;0&#39;,  &#39;1&#39;, &#39;0&#39;], &#39;Experiment Parameters&#39;: [&quot;awa&quot;, &#39;02s&#39;, &#39;efe&#39;,&#39;loa&#39;]})
Out[2]:
	ID	Test	Date	Success	Experiment Parameters
0	A	e2z	    2022	1	    awa
1	A	e2z	    2022	0	    02s
2	B	r5t	    2019	1	    efe
3	C	x3w	    2023	0	    loa

I've tried my hand at aggregate and grouping python functions following https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby/15705958#15705958 like this:

aggre = {&#39;Date&#39;: &#39;unique&#39;, &#39;Success&#39;: &#39;sum&#39;}
idx = input_df.groupby([&#39;Test&#39;])[&#39;Date&#39;].transform(max) == input_df[&#39;Date&#39;]
input_df = input_df[idx].groupby([&#39;Test&#39;]).aggregate(aggre)

but these solutions force the rows to be combined, and I need to just subselect rows. I can't simply have the Experiment Parameters variable be condensed with the aggregate functions either since I need each row to serve as an independent data point to a model. I can't use solutions from https://stackoverflow.com/questions/30987055/python-pandas-select-rows-based-on-comparison-across-rows since I need possibly multiple rows to be preserved. Using methods like .apply(helper_function) don't show promise since my decisions to select rows depend on the values in other rows. I can't find any other tricks and functions to subselect rows in the dependent manner I need to perform.

How can I achieve my desired dataframe?

答案1

得分: 1

我对代码的翻译如下：

# 对每组ID进行排序，以便最近且成功的测试出现在每组ID的顶部。然后使用duplicated选择每个ID组的第一行，然后执行自连接以仅选择具有最佳测试的行。
df = df.sort_values(["ID", "Date", "Success"], ascending=[True, False, False])
# 选择不重复的ID，并仅保留 'ID' 和 'Test' 列
best_test = df.loc[~df["ID"].duplicated()][['ID', 'Test']]
# 使用 'ID' 和 'Test' 列将数据框与最佳测试数据框进行合并
df2 = df.merge(best_test, on=['ID', 'Test'])

英文:

I perform a sort so that the most recent and successful test is on top of each group of ID's. Then I use duplicated to grab the first row of each ID group and then perform a self merge to only grab the rows with the best test.

df = df.sort_values([&quot;ID&quot;, &quot;Date&quot;, &quot;Success&quot;], ascending=[True, False, False])
best_test = df.loc[~df[&quot;ID&quot;].duplicated()][[&#39;ID&#39;, &#39;Test&#39;]]
df2 = df.merge(best_test, on = [&#39;ID&#39;, &#39;Test&#39;])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Pandas：使用跨行比较选择多个相关行的数据框行

问题

答案1

如何打印列表中每个重复字符串的值？

Is it possible in SQlite3 python that if a row has been previously updated, it cannot be updated for like 5 seconds

如何在Python中获取整数输入的一部分

如何在成功的POST请求后更改Class-Based-Views的表单字段填充？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。