英文:
Fast method to create nested list with different types: numpy, pandas or list concatenation?
问题
我正在尝试加速下面的代码,该代码生成一个具有每列不同类型的列表列表。我最初创建了pandas数据帧,然后将其转换为列表,但这似乎相当慢。如何能够以更快的速度创建这个列表,例如提高一个数量级?除了一个列之外,所有列都是恒定的。
import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
# 在SQL代码中,该列是decimal(13, 2)
p=13
s=3
max_limit = float("9"*(p-s) + "." + "9"*s)
if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
raise Exception("Non-numeric or empty array.")
else:
return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)
def list_creation(y_forc):
backcast_length = len(y_forc)
backcast = pd.DataFrame(data=np.full(backcast_length, 2),
columns=['TypeId'])
backcast['id2'] = None
backcast['Daily'] = 1
backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
backcast['ForecastMethodId'] = 1
backcast['ForecastVolume'] = overflow_check(y_forc.values)
backcast['CreatedBy'] = 'test'
backcast['CreatedDt'] = pd.to_datetime('today')
return backcast.values.tolist()
i = pd.date_range('05-01-2010', '21-05-2018', freq='D')
x = pd.DataFrame(index=i, data=np.random.randint(0, 100, len(i)))
t = time.perf_counter()
y = list_creation(x)
print(time.perf_counter()-t)
请注意,我只翻译了您的代码部分,没有包括任何其他内容。
英文:
I am trying to accelerate the code below that produces a list of lists with different types for each column. I originally created pandas dataframe and then converted it to list, but this seems to be fairly slow. How can I create this list faster, by say an order of magnitude? All columns are constant except one.
import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
# in SQL code the column is decimal(13, 2)
p=13
s=3
max_limit = float("9"*(p-s) + "." + "9"*s)
#min_limit = 0.01 #float("0" + "." + "0"*(s-2) + '1')
#min_limit = 0.1
if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
raise Exception("Non-numeric or empty array.")
else:
#print(x)
return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)
def list_creation(y_forc):
backcast_length = len(y_forc)
backcast = pd.DataFrame(data=np.full(backcast_length, 2),
columns=['TypeId'])
backcast['id2'] = None
backcast['Daily'] = 1
backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
backcast['ForecastMethodId'] = 1
backcast['ForecastVolume'] = overflow_check(y_forc.values)
backcast['CreatedBy'] = 'test'
backcast['CreatedDt'] = pd.to_datetime('today')
return backcast.values.tolist()
i=pd.date_range('05-01-2010', '21-05-2018', freq='D')
x=pd.DataFrame(index=i, data = np.random.randint(0, 100, len(i)))
t=time.perf_counter()
y =list_creation(x)
print(time.perf_counter()-t)
答案1
得分: 1
Sure, here's the translated code you provided:
这应该会快一些,它直接创建了列表:
def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 = pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
%%timeit
list_creation(x)
> 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
> 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
编辑:减慢速度的一个主要问题是从日期时间转换为指定格式所需的时间。如果我们可以通过以下方式来消除这个问题:
def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 = pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
x = np.random.randint(0, 100, len(i))
然后这现在会快得多:
%%timeit
list_creation1(i, x)
> 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Please note that due to formatting constraints, some special characters and symbols might not appear exactly as in the original code.
英文:
This should be a bit faster, it just directly creates the list:
def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
%%timeit
list_creation(x)
> 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
> 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Edit: one of the large issues with the slowness is the time it takes to go from datetime to specified format. if we can get rid of that by phrasing it as the following:
def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
x=np.random.randint(0, 100, len(i))
Then this is now a lot faster:
%%timeit
list_creation1(i, x)
> 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论