英文:
Python or SQL solution- Creating effective date and expiration date table
问题
我有一张跟踪银行账户信息的表格。每个银行账户都可以随时添加或删除感兴趣的人(甚至在同一天)。这个人可以被添加一次并永不删除,或者被添加和删除一次或多次(甚至一天多次)。
示例:
日期 | 银行ID | 账户ID | 类型 | 个人ID |
---|---|---|---|---|
2/9/2022 | 0001 | 0004 | 添加 | 0015 |
2/10/2022 | 0004 | 0005 | 添加 | 0038 |
3/2/2022 | 0001 | 0004 | 删除 | 0015 |
如上所示,个人0015被添加为账户0004的感兴趣方,该账户位于银行0001。我希望将数据更改为以有效日期-到期日期的格式。
因此,我们会看到:
有效日期 | 到期日期 | 银行ID | 账户ID | 个人ID |
---|---|---|---|---|
2/9/2022 | 3/2/2022 | 0001 | 0004 | 0015 |
2/10/2022 | null | 0004 | 0005 | 0038 |
我的数据超过2.5亿行,所以我正在寻找一种高效的方法。SQL非常慢,所以我尝试使用Python。但我唯一能想到的方法是遍历每个唯一的银行/账户/个人实例,然后对于每个添加,找到大于添加日期的最近的删除(如果有的话)。
循环在pandas中从不是高效的方法,有人可以帮助找到更清晰和更高效的方法来收集这些数据吗?这也可以在SQL中完成,但这些数据永远不会成为新表;它用于分析目的。
英文:
I have a table that tracks bank account information. Each bank account can have someone added as a interested party and taken off at any time (even same day). The person can be added once and never removed or added and removed once or multiple times (even several times a day)
Example:
Date | BankId | AccountID | Type | PersonId |
---|---|---|---|---|
2/9/2022 | 0001 | 0004 | Addition | 0015 |
2/10/2022 | 0004 | 0005 | Addition | 0038 |
3/2/2022 | 0001 | 0004 | Deletion | 0015 |
As seen above person 0015 was added as an Interested Party to account 0004 which is housed at bank 0001. I want to get the data as more of a effective date- expiration date format.
So instead we would see
EffectiveDate | ExpirationDate | BankId | AccountID | PersonId |
---|---|---|---|---|
2/9/2022 | 3/2/2022 | 0001 | 0004 | 0015 |
2/10/2022 | null | 0004 | 0005 | 0038 |
My data is over 250 million rows, so I am looking for an efficient way to do this. SQL has been extremely slow, so I am trying python. But the only way I can think to do this is to loop through each unique instance of bank/account/person and for each addition find the closest deletion (if any) that is greater than the date of the addition.
Loops are never effective with pandas, so can anyone assist with a clearer more efficient way to gather this data? It could be in SQL too, but this data will never be a new table; it is being used for analytics purposes.
答案1
得分: 1
I'm supposing that the data is sorted by the Date
. Then you can do:
df["tmp"] = df["Type"].eq("Addition")
df["tmp"] = df.groupby(["BankId", "AccountID", "PersonId"])["tmp"].cumsum()
out = (
df.pivot(
index=["BankId", "AccountID", "PersonId", "tmp"], columns="Type", values="Date"
)
.reset_index()
.rename_axis(None, axis=1)
.drop(columns="tmp")
)
print(out)
Prints:
BankId AccountID PersonId Addition Deletion
0 1 4 15 2/9/2022 3/2/2022
1 4 5 38 2/10/2022 NaN
英文:
I'm supposing that the data is sorted by the Date
. Then you can do:
df["tmp"] = df["Type"].eq("Addition")
df["tmp"] = df.groupby(["BankId", "AccountID", "PersonId"])["tmp"].cumsum()
out = (
df.pivot(
index=["BankId", "AccountID", "PersonId", "tmp"], columns="Type", values="Date"
)
.reset_index()
.rename_axis(None, axis=1)
.drop(columns="tmp")
)
print(out)
Prints:
BankId AccountID PersonId Addition Deletion
0 1 4 15 2/9/2022 3/2/2022
1 4 5 38 2/10/2022 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论