Pandas – 在按时间顺序排列的数据框中筛选转换。

huangapple go评论51阅读模式
英文:

Pandas - filtering on a transition within a chronologically ordered dataframe

问题

我有一个数据框,格式如下,其中每一行表示特定标本在时间上的快照。

随着时间的推移,一个标本可以从类型 1 移动到类型 2,但不能从类型 2 移动到类型 1。还有其他类型,如 345,但我认为如果我知道如何处理 12,那么对其他类型也可以起作用。

数据中存在从 21 的转换错误,我的目标是找到它们并生成一组 ID

例如,输出应该是 456,因为随着时间的推移,它从类型 2 变为类型 1,这是一个错误。

我尝试过的方法是按 ID 排序,然后按 Snapshot Month 排序,考虑通过切片数据框按 Type 进行处理,并使用循环找到每个 IDType1 的最大日期和 Type2 的最小日期,并在 ID 上合并两者,然后检查最大日期是否大于最小日期。

但这不仅不符合Pythonic的风格,而且效率低下(需要循环)。我想知道是否有更好的方法?

英文:

I have a dataframe with the format shown below, where each row represents a snapshot in time of a particular specimen.

As time moves forward, a specimen can move from type 1 to type 2, but not from type 2 to type 1. There are other types such as 3, 4, and 5, but I figured if I know how to deal with 1 and 2, I can make it work for the others as well.

The data contains errors where there are transitions from 2 to 1, and my goal is to find them and produce a set of ID.

For example, the output should be 456, since it went from Type 2 to Type 1 as time passes, which is an error.

ID (not unique) Snapshot Month (YYYYMMDD) Type (1, 2, 3, 4, 5)
123 20210131 1
123 20210521 2
456 20210131 2
456 20210521 1

What I have tried is to sort by ID, then by Snapshot Month, and thought about slicing the dataframe by Type and with a loop, find the maximum date for each ID where Type is 1 and minimum date for each ID where Type is 2, and merge the two on ID and check to see if the maximum is greater than the minimum.

But not only this is unpythonic, but also inefficient (loops). I wonder if there are better ways?

答案1

得分: 1

按月份对数值进行排序,然后按ID对Type的行进行移动。比较ID中的先前值和当前值,以识别错误的转换,然后使用loc来筛选所有这样的ID。

prev = df.sort_values('Snapshot Month').groupby('ID')['Type'].shift()
all_ids = df.loc[df['Type'].eq(1) & prev.eq(2), 'ID'].unique()

结果:

array([456])
英文:

Sort the values by month then shift the rows in Type per ID. Compare the previous value and current value in ID to identify the wrong transitions then use loc to filter all such IDS

prev = df.sort_values('Snapshot Month').groupby('ID')['Type'].shift()
all_ids = df.loc[df['Type'].eq(1) & prev.eq(2), 'ID'].unique()

Result

array([456])

答案2

得分: 1

将你的数据按ID和快照分组,然后执行一次偏移可能会有所帮助:

# 按ID和快照对数据框进行排序
df.sort_values(['ID', 'Snapshot'], inplace=True)

# 然后按ID进行分组
grouped = df.groupby('ID')

# 创建一个前一个类型的列
df['Previous Type'] = grouped['Type'].shift(1)

# 通过比较值和前一个值来获取包含错误的行
errors = df[(df['Type'] == 1) & (df['Previous Type'] == 2)]

# 仅保留唯一的ID
error_ids = errors['ID'].unique()
英文:

Grouping your data by ID and snaphot then doing a shift might help you :

# sort the dataframe by ID and Snapshot
df.sort_values(['ID', 'Snapshot'], inplace=True)

# then group by ID
grouped = df.groupby('ID')

# Create a previous type column
df['Previous Type'] = grouped['Type'].shift(1)

# get the lines that contains an error by comparing the value and the previous one
errors = df[(df['Type'] == 1) & (df['Previous Type'] == 2)]

# keep only the unique ID
error_ids = errors['ID'].unique()

答案3

得分: 1

另一种可能的解决方案:

(df['ID'][
    df.sort_values(['ID', 'Month'])
    .pipe(lambda x: 
        x['ID'].eq(x['ID'].shift()) & x['Type'].eq(1) & x['Type'].shift().eq(2))]
 .unique())

输出:

array([456])
英文:

Another possible solution:

(df['ID'][
    df.sort_values(['ID', 'Month'])
    .pipe(lambda x: 
        x['ID'].eq(x['ID'].shift()) & x['Type'].eq(1) & x['Type'].shift().eq(2))]
 .unique())

Output:

array([456])

huangapple
  • 本文由 发表于 2023年5月20日 22:11:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76295653.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定