如何根据一组列的组合作为主键,从另一个CSV文件更新CSV文件?

huangapple go评论64阅读模式
英文:

How to update a csv from another csv considering a combination of columns as the primary key?

问题

我在我的任务中遇到了一个难题,需要构建一个Python项目来执行以下任务。我很诚恳地请求您的帮助,以使其完成!我有两个具有相同标题的CSV文件,它们是:

[ID,日期,时间,类型,状态,成员,账户,属性,信用数量,借记数量,净数量,信用价值,借记价值,净值,货币]

一个CSV是input.csv,其中包含了一个运行的输入值,另一个CSV是system.csv,它充当一个包含每次运行后所有汇总值的数据库。必须根据input.csv更新system.csv。

在更新systems.csv时,将考虑以下字段的组合作为主键:'日期','成员','账户','属性'

如果在systems.csv中找到主键的值,那么在信用数量、借记数量、净数量、信用价值、借记价值、净值下的每个值都必须被添加到现有值中,如下所示:

input.csv

ID,日期,时间,类型,状态,成员,账户,属性,信用数量,借记数量,净数量,信用价值,借记价值,净值,货币
id01,2023.03.16,21:00:00,visa,active,xyz,acc001,cc,100, 0, 1000, 100, 0, 1000, usd
id02,2023.03.16,22:00:00,visa,active,abc,acc002,cc,0,200, 2000, 0, 200, 2000, usd

system.csv

id101,2023.03.16,08:00:00,visa,active,xyz,acc001,cc,500, 0, 5000, 400, 0, 4000, usd
id102,2023.03.16,09:00:00,visa,active,abc,acc002,cc,0,600, 6000, 0, 200, 2000, usd

运行后的system.csv

id101,2023.03.16,21:00:00,visa,active,xyz,acc001,cc,600, 0, 6000, 500, 0, 5000, usd
id102,2023.03.16,22:00:00,visa,active,abc,acc002,cc,0,800, 8000, 0, 400, 4000, usd

目前我已经将两个csv文件作为数据帧,并尝试进行处理。但由于我知识的不足,请帮助我完成这个任务。提前感谢您!

英文:

I am having a block in my assignment to build a python project to do the following task. I'm kindly seeking help from you to make it complete!
I have 2 CSVs with same headers, which are:

[ID,Date,time,type,status,member,account,property,credit quantity, debitquantity, Net quantity, credit value, debit value, Net value, currency]

One csv is input.csv which contains input values for a run and other csv is system.csv which act like a database containing all the summed up values after each run. system.csv must be updated based on the input.csv.

When updating the systems.csv the following combination of fields is considered as the primary key: 'Date', 'member', 'account', 'property'

if the values of the primary key is found in the systems.csv each values under credit quantity, debit quantity, Net quantity, credit value, debit value, Net value must be added to the existing values as follows:

input.csv

ID,Date,time,type,status,member,account,property,credit quantity, debit quantity, Net quantity, credit value, debit value, Net value, currency
id01,2023.03.16,21:00:00,visa,active,xyz,acc001,cc,100, 0, 1000, 100, 0, 1000, usd
id02,2023.03.16,22:00:00,visa,active,abc,acc002,cc,0,200, 2000, 0, 200, 2000, usd

system.csv

id101,2023.03.16,08:00:00,visa,active,xyz,acc001,cc,500, 0, 5000, 400, 0, 4000, usd
id102,2023.03.16,09:00:00,visa,active,abc,acc002,cc,0,600, 6000, 0, 200, 2000, usd

system.csv after run

id101,2023.03.16,21:00:00,visa,active,xyz,acc001,cc,600, 0, 6000, 500, 0, 5000, usd
id102,2023.03.16,22:00:00,visa,active,abc,acc002,cc,0,800, 8000, 0, 400, 4000, usd

Currently I've taken two csvs as dataframes and trying to do the process. but since my lack of knowledge please help me to complete this. Thanks in advance!!

答案1

得分: 1

IIUC,您可以使用:

df_in = pd.read_csv("input.csv")
df_sys = pd.read_csv("system.csv")  # 可选的 `header=None, names=df_in.columns`

pkeys = ["Date", "member", "account", "property"]
scols = df_in.select_dtypes("object").columns.difference(pkeys)
ncols = df_in.columns.difference(scols.union(pkeys))

df_run = (
    pd.concat([df_in, df_sys])
    .groupby(pkeys, as_index=False, sort=False)
    .agg(
        {**{col: "sum" for col in ncols},
         **{col: "first" for col in scols}})
    [df_in.columns]
)

# df_run.to_csv("system.csv", index=False)  # 取消注释以覆盖旧的 `.csv` 文件

输出(以表格格式显示的 system.csv):

ID Date time type status member account property credit quantity debit quantity Net quantity credit value debit value Net value currency
id101 2023.03.16 21:00:00 visa active xyz acc001 cc 600 0 6000 500 0 5000 usd
id102 2023.03.16 22:00:00 visa active abc acc002 cc 0 800 8000 0 400 4000 usd
英文:

IIUC, you can use :

df_in = pd.read_csv("input.csv")
df_sys = pd.read_csv("system.csv") #with optional `header=None, names=df_in.columns`


pkeys = ["Date", "member", "account", "property"]
scols = df_in.select_dtypes("object").columns.difference(pkeys)
ncols = df_in.columns.difference(scols.union(pkeys))

df_run = (
    pd.concat([df_in, df_sys])
        .groupby(pkeys,  as_index=False, sort=False).agg(
            {**{col: "sum" for col in ncols},
             **{col: "first" for col in scols}})
        [df_in.columns]
)

# df_run.to_csv("system.csv", index=False) #uncomment to overwrite the old `.csv` 

Output (system.csv in a tabular format) :

ID Date time type status member account property credit quantity debit quantity Net quantity credit value debit value Net value currency
id101 2023.03.16 21:00:00 visa active xyz acc001 cc 600 0 6000 500 0 5000 usd
id102 2023.03.16 22:00:00 visa active abc acc002 cc 0 800 8000 0 400 4000 usd

huangapple
  • 本文由 发表于 2023年6月19日 16:10:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76504762.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定