2023年5月20日 22:11:46go评论64阅读模式

英文:

Pandas - filtering on a transition within a chronologically ordered dataframe

问题

我有一个数据框，格式如下，其中每一行表示特定标本在时间上的快照。

随着时间的推移，一个标本可以从类型 1 移动到类型 2，但不能从类型 2 移动到类型 1。还有其他类型，如 3、4 和 5，但我认为如果我知道如何处理 1 和 2，那么对其他类型也可以起作用。

数据中存在从 2 到 1 的转换错误，我的目标是找到它们并生成一组 ID。

例如，输出应该是 456，因为随着时间的推移，它从类型 2 变为类型 1，这是一个错误。

我尝试过的方法是按 ID 排序，然后按 Snapshot Month 排序，考虑通过切片数据框按 Type 进行处理，并使用循环找到每个 ID 中 Type 为 1 的最大日期和 Type 为 2 的最小日期，并在 ID 上合并两者，然后检查最大日期是否大于最小日期。

但这不仅不符合Pythonic的风格，而且效率低下（需要循环）。我想知道是否有更好的方法？

英文:

I have a dataframe with the format shown below, where each row represents a snapshot in time of a particular specimen.

As time moves forward, a specimen can move from type 1 to type 2, but not from type 2 to type 1. There are other types such as 3, 4, and 5, but I figured if I know how to deal with 1 and 2, I can make it work for the others as well.

The data contains errors where there are transitions from 2 to 1, and my goal is to find them and produce a set of ID.

For example, the output should be 456, since it went from Type 2 to Type 1 as time passes, which is an error.

ID (not unique)	Snapshot Month (YYYYMMDD)	Type (1, 2, 3, 4, 5)
123	20210131	1
123	20210521	2
456	20210131	2
456	20210521	1

What I have tried is to sort by ID, then by Snapshot Month, and thought about slicing the dataframe by Type and with a loop, find the maximum date for each ID where Type is 1 and minimum date for each ID where Type is 2, and merge the two on ID and check to see if the maximum is greater than the minimum.

But not only this is unpythonic, but also inefficient (loops). I wonder if there are better ways?

答案1

得分: 1

按月份对数值进行排序，然后按ID对Type的行进行移动。比较ID中的先前值和当前值，以识别错误的转换，然后使用loc来筛选所有这样的ID。

prev = df.sort_values('Snapshot Month').groupby('ID')['Type'].shift()
all_ids = df.loc[df['Type'].eq(1) & prev.eq(2), 'ID'].unique()

结果：

array([456])

英文:

Sort the values by month then shift the rows in Type per ID. Compare the previous value and current value in ID to identify the wrong transitions then use loc to filter all such IDS

prev = df.sort_values(&#39;Snapshot Month&#39;).groupby(&#39;ID&#39;)[&#39;Type&#39;].shift()
all_ids = df.loc[df[&#39;Type&#39;].eq(1) &amp; prev.eq(2), &#39;ID&#39;].unique()

Result

array([456])

答案2

得分: 1

将你的数据按ID和快照分组，然后执行一次偏移可能会有所帮助：

# 按ID和快照对数据框进行排序
df.sort_values(['ID', 'Snapshot'], inplace=True)

# 然后按ID进行分组
grouped = df.groupby('ID')

# 创建一个前一个类型的列
df['Previous Type'] = grouped['Type'].shift(1)

# 通过比较值和前一个值来获取包含错误的行
errors = df[(df['Type'] == 1) & (df['Previous Type'] == 2)]

# 仅保留唯一的ID
error_ids = errors['ID'].unique()

英文:

Grouping your data by ID and snaphot then doing a shift might help you :

# sort the dataframe by ID and Snapshot
df.sort_values([&#39;ID&#39;, &#39;Snapshot&#39;], inplace=True)

# then group by ID
grouped = df.groupby(&#39;ID&#39;)

# Create a previous type column
df[&#39;Previous Type&#39;] = grouped[&#39;Type&#39;].shift(1)

# get the lines that contains an error by comparing the value and the previous one
errors = df[(df[&#39;Type&#39;] == 1) &amp; (df[&#39;Previous Type&#39;] == 2)]

# keep only the unique ID
error_ids = errors[&#39;ID&#39;].unique()

答案3

得分: 1

另一种可能的解决方案：

(df['ID'][
    df.sort_values(['ID', 'Month'])
    .pipe(lambda x: 
        x['ID'].eq(x['ID'].shift()) & x['Type'].eq(1) & x['Type'].shift().eq(2))]
 .unique())

输出：

array([456])

英文:

Another possible solution:

(df[&#39;ID&#39;][
    df.sort_values([&#39;ID&#39;, &#39;Month&#39;])
    .pipe(lambda x: 
        x[&#39;ID&#39;].eq(x[&#39;ID&#39;].shift()) &amp; x[&#39;Type&#39;].eq(1) &amp; x[&#39;Type&#39;].shift().eq(2))]
 .unique())

Output:

array([456])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas – 在按时间顺序排列的数据框中筛选转换。

问题

答案1

答案2

答案3

网页抓取数据的格式化 BS4

从嵌套数组中筛选出元素

使用左连接在R中合并两个数据框。

如果两行之间的某一列数值匹配，根据条件保留带有第三列数值的较新行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论