英文:
performance increase in panda in a liquidity analysis
问题
我有一个包含大约90,000条记录的大型CSV表格。我使用Python和Pandas进行数据分析(遗憾的是由于程序的限制,我只能使用Pandas版本0.2)。每个记录都有一个日期、一个帐户号码和两列浮点值。
对于流动性分析,我需要在给定日期之前对特定帐户号码的两列求和。给定的日期是一年中的每一天。
因此,我需要对每个帐户号码的每个日期(从1月1日到12月31日)对两列进行求和。举个例子,如果以7月5日和帐户号码3000为例,我需要对所有帐户号码等于3000且日期<= 7月5日的所有条目的两列求和。
我使用以下代码执行此操作:
analyseRange = pd.date_range(settings.analyseStartDatum, settings.analyseEndDatum)
for datum in analyseRange:
globals()[str(datum) + "-vbTable"] = pd.DataFrame(columns=["Konto", "Kontobezeichnung", "Datum", "Saldo", "nicht Fällig", "Fällige VB"])
for i in range(settings.accountNummernStart, settings.accountNummernEnde):
startTimer = time.perf_counter()
buchungen = df.loc[(df[settings.kontoNrCol] == i) & (df[settings.belegDatumCol] <= datum),[settings.kontoNrCol, settings.belegDatumCol, settings.sollCol, settings.habenCol, settings.faelligkeitsCol, "FaelligBisClean"]]
verbindlichkeiten = buchungen[settings.habenCol].sum() - buchungen[settings.sollCol].sum()
nichtFaelligeBuchungen = df.loc[(df[settings.kontoNrCol] == konto) & (df[settings.belegDatumCol] <= datum) & (df["FaelligBisClean"] > datum),[settings.sollCol, settings.habenCol]]
nichtFaelligeVerbindlichkeit = nichtFaelligeBuchungen[settings.habenCol].sum()
if(verbindlichkeiten != 0):
globals()[str(datum) + "-vbTable"].append([i,"tbd.",datum,verbindlichkeiten,nichtFaelligeVerbindlichkeiten,(verbindlichkeiten - nichtFaelligeVerbindlichkeiten)])
print("append: " + str(i) + "-tbd.-" + " - "+ str(datum) + " - "+ str(verbindlichkeiten) + " - "+ str(nichtFaelligeVerbindlichkeiten) + " - "+ str(verbindlichkeiten - nichtFaelligeVerbindlichkeiten)+ ",time: " + str(time.perf_counter() - startTimer), flush=True)
else:
print("Konto is even: " + str(i) + ",time: " + str(time.perf_counter() - startTimer), flush=True)
return ""
analyseStartDatum
是一年的第一天,analyseEndDatum
是一年的最后一天。accountNummernStart
是10000,accountNummernEnde
是90000。
我将每个计算的日期存储在单独的数据框中(globals()[str(datum) + "-vbTable"]
)以便稍后保存到硬盘上。
因此,我需要检查365天的总共90,000个帐户。这总共需要进行3200万次计算。每次计算非常快,只需要0.002秒。但由于需要执行3200万次,因此需要大约18小时才能完成:D 这太慢了。
是否有人有一些性能提升的建议?我主要寻找基于Pandas的性能提升。我已经知道一些方法可以减少我需要检查的记录数量,所以在那方面可以提升,但也许有人知道一些聪明的Pandas命令,可以更快地计算这些东西。这是我第一次使用Pandas,所以我对此还不太熟悉。
我的第一个想法是对求和进行“计数”。我不是每次都计算整行的总和,而是将新的条目添加到总和中。但我以后必须添加一些更多的逻辑,以决定是否真的需要添加该条目,这取决于之前的条目。因此,我仍然需要检查“旧”的条目,所以我在这方面没有看到性能提升。
英文:
I have a big CSV table with atm 90k records. I use python + panda for data analysis. (sadly to restriction from the programm, I can only use panda v.0.2). Every record has a date, an account number and two columns with a float value.
For the liquidity analysis I need the sum of the two columns to a given date for a specific account number. The given dates are each day in a year.
So I need the sum of the two columns for the 1st Jan, 2nd Jan, 3rd Jan,....,31th Dec for each account.
If I take the 5th July and account number 3000 as an example, I need the sum of both columns of all entrys with account Number = 3000 and date <= 5th July.
I did that with the following code:
analyseRange = pd.date_range(settings.analyseStartDatum, settings.analyseEndDatum)
for datum in analyseRange:
globals()[str(datum) + "-vbTable"] = pd.DataFrame(columns=["Konto","Kontobezeichnung", "Datum", "Saldo","nicht Fällig", "Fällige VB"])
for i in range(settings.accountNummernStart ,settings.accountNummernEnde):
startTimer = time.perf_counter()
buchungen = df.loc[(df[settings.kontoNrCol] == i) & (df[settings.belegDatumCol] <= datum),[settings.kontoNrCol, settings.belegDatumCol, settings.sollCol, settings.habenCol, settings.faelligkeitsCol, "FaelligBisClean"]]
verbindlichkeiten = buchungen[settings.habenCol].sum() - buchungen[settings.sollCol].sum()
nichtFaelligeBuchungen = df.loc[(df[settings.kontoNrCol] == konto) & (df[settings.belegDatumCol] <= datum) & (df["FaelligBisClean"] > datum),[settings.sollCol, settings.habenCol]]
nichtFaelligeVerbindlichkeit = nichtFaelligeBuchungen[settings.habenCol].sum()
if(verbindlichkeiten != 0):
globals()[str(datum) + "-vbTable"].append([i,"tbd.",datum,verbindlichkeiten,nichtFaelligeVerbindlichkeiten,(verbindlichkeiten - nichtFaelligeVerbindlichkeiten)])
print("append: " + str(i) + "-tbd.-" + " - "+ str(datum) + " - "+ str(verbindlichkeiten) + " - "+ str(nichtFaelligeVerbindlichkeiten) + " - "+ str(verbindlichkeiten - nichtFaelligeVerbindlichkeiten)+ ",time: " + str(time.perf_counter() - startTimer), flush=True)
else:
print("Konto is even: " + str(i) + ",time: " + str(time.perf_counter() - startTimer), flush=True)
return ""
The analyseStartDatum is the first of the year, analyseEndDatum is the last day of a year.
accountNummernStart is 10000 and accountNummernEnde is 90000.
I store every calculated day in a seperate dataframe (globals()[str(datum) + "-vbTable"]
) to save it later on the drive.
So I have to check for 365 days a total of 90k accounts. Thats a total of 32 million calculations. One calculation is pretty fast, only 0,002s. But since its need to be executed 32 million time, it take around 18h to finish Thats waaay to slow.
Anyone have some suggestions for some performance boost? I am looking mostly for panda based performance boost. I already know a few ways to reduce the amount of records I have to check. So I get a boost there, but maybe someone have some smart panda commands that calculates the stuff faster. This is the first time working with pandas for me, so Im pretty new to this.
My first idea was a "counter" for the sum. Instead of calculating the sum of the complete row each time, I just add the new entrys to the sum. But I have to add some more logic later, that decides if I really have to add the entry and that depends on the entrys before. So I would have to check the "old" entrys anyway. So I don`t see a boost there.
答案1
得分: 0
如果我正确理解你的问题,你可以像这样操作:
import pandas as pd
import numpy as np
# 简单的示例数据框
df = pd.DataFrame({'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
'account': [10, 20, 30, 10, 20, 30, 10, 20, 30, 10, 20, 30],
'number1': np.linspace(100, 110, 12),
'number2': np.linspace(110, 130, 12)})
# 计算两列的总和并存储
df['total'] = df['number1'] + df['number2']
# 预先排序; 特别是按'date'排序,以便累积求和
df.sort_values(['account', 'date'])
for account, group in df.sort_values(['account', 'date']).groupby('account'):
group['cum-total'] = group.total.cumsum()
print(account, ":")
print(group)
print()
可能有更巧妙的一行代码的方法,但这应该能够帮助你很大程度上,并且应该很快。
不需要多个数据框架;所有这些都可以在数据框架内部完成。
需要注意的是,这是针对Pandas 1.x及更高版本的,因此可能需要一些适应。例如,在版本0.2中,.sort_values
可能只是 .sort
。
英文:
If I understand your problem correctly, you can do something like this:
import pandas as pd
import numpy as np
# Simple example dataframe
df = pd.DataFrame({'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
'account': [10, 20, 30, 10, 20, 30, 10, 20, 30, 10, 20, 30],
'number1': np.linspace(100, 110, 12),
'number2': np.linspace(110, 130, 12)})
# Calculate the of two columns and store it
df['total'] = df['number1'] + df['number2']
# Sort beforehand; in particular on 'date', for the
# cumulative sum
df.sort_values(['account', 'date'])
for account, group in df.sort_values(['account', 'date']).groupby('account'):
group['cum-total'] = group.total.cumsum()
print(account, ":")
print(group)
print()
There are probably niftier ways to do this in a one liner, but this should get you a long way, and should be fast.
No multiple dataframes are needed; that can all be done intrinsically to dataframes.
One caveat is that this is for Pandas 1.x and later, so it may need some adapation. For example, .sort_values
is probably just .sort
in version 0.2.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论