Python或SQL解决方案- 创建有效日期和到期日期表

huangapple go评论60阅读模式
英文:

Python or SQL solution- Creating effective date and expiration date table

问题

我有一张跟踪银行账户信息的表格。每个银行账户都可以随时添加或删除感兴趣的人(甚至在同一天)。这个人可以被添加一次并永不删除,或者被添加和删除一次或多次(甚至一天多次)。

示例:

日期 银行ID 账户ID 类型 个人ID
2/9/2022 0001 0004 添加 0015
2/10/2022 0004 0005 添加 0038
3/2/2022 0001 0004 删除 0015

如上所示,个人0015被添加为账户0004的感兴趣方,该账户位于银行0001。我希望将数据更改为以有效日期-到期日期的格式。

因此,我们会看到:

有效日期 到期日期 银行ID 账户ID 个人ID
2/9/2022 3/2/2022 0001 0004 0015
2/10/2022 null 0004 0005 0038

我的数据超过2.5亿行,所以我正在寻找一种高效的方法。SQL非常慢,所以我尝试使用Python。但我唯一能想到的方法是遍历每个唯一的银行/账户/个人实例,然后对于每个添加,找到大于添加日期的最近的删除(如果有的话)。

循环在pandas中从不是高效的方法,有人可以帮助找到更清晰和更高效的方法来收集这些数据吗?这也可以在SQL中完成,但这些数据永远不会成为新表;它用于分析目的。

英文:

I have a table that tracks bank account information. Each bank account can have someone added as a interested party and taken off at any time (even same day). The person can be added once and never removed or added and removed once or multiple times (even several times a day)

Example:

Date BankId AccountID Type PersonId
2/9/2022 0001 0004 Addition 0015
2/10/2022 0004 0005 Addition 0038
3/2/2022 0001 0004 Deletion 0015

As seen above person 0015 was added as an Interested Party to account 0004 which is housed at bank 0001. I want to get the data as more of a effective date- expiration date format.

So instead we would see

EffectiveDate ExpirationDate BankId AccountID PersonId
2/9/2022 3/2/2022 0001 0004 0015
2/10/2022 null 0004 0005 0038

My data is over 250 million rows, so I am looking for an efficient way to do this. SQL has been extremely slow, so I am trying python. But the only way I can think to do this is to loop through each unique instance of bank/account/person and for each addition find the closest deletion (if any) that is greater than the date of the addition.

Loops are never effective with pandas, so can anyone assist with a clearer more efficient way to gather this data? It could be in SQL too, but this data will never be a new table; it is being used for analytics purposes.

答案1

得分: 1

I'm supposing that the data is sorted by the Date. Then you can do:

df["tmp"] = df["Type"].eq("Addition")
df["tmp"] = df.groupby(["BankId", "AccountID", "PersonId"])["tmp"].cumsum()
out = (
    df.pivot(
        index=["BankId", "AccountID", "PersonId", "tmp"], columns="Type", values="Date"
    )
    .reset_index()
    .rename_axis(None, axis=1)
    .drop(columns="tmp")
)

print(out)

Prints:

   BankId  AccountID  PersonId   Addition  Deletion
0       1          4        15   2/9/2022  3/2/2022
1       4          5        38  2/10/2022       NaN
英文:

I'm supposing that the data is sorted by the Date. Then you can do:

df["tmp"] = df["Type"].eq("Addition")
df["tmp"] = df.groupby(["BankId", "AccountID", "PersonId"])["tmp"].cumsum()
out = (
    df.pivot(
        index=["BankId", "AccountID", "PersonId", "tmp"], columns="Type", values="Date"
    )
    .reset_index()
    .rename_axis(None, axis=1)
    .drop(columns="tmp")
)

print(out)

Prints:

   BankId  AccountID  PersonId   Addition  Deletion
0       1          4        15   2/9/2022  3/2/2022
1       4          5        38  2/10/2022       NaN

huangapple
  • 本文由 发表于 2023年4月4日 05:39:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923968.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定