快速创建包含不同类型元素的嵌套列表的方法:numpy、pandas还是列表连接?

huangapple go评论68阅读模式
英文:

Fast method to create nested list with different types: numpy, pandas or list concatenation?

问题

我正在尝试加速下面的代码,该代码生成一个具有每列不同类型的列表列表。我最初创建了pandas数据帧,然后将其转换为列表,但这似乎相当慢。如何能够以更快的速度创建这个列表,例如提高一个数量级?除了一个列之外,所有列都是恒定的。

import pandas as pd
import numpy as np
import time
import datetime

def overflow_check(x):
    # 在SQL代码中,该列是decimal(13, 2)
    p=13
    s=3
    max_limit = float("9"*(p-s) + "." + "9"*s)
    if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
        raise Exception("Non-numeric or empty array.")
    else:
        return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)

def list_creation(y_forc):
    backcast_length = len(y_forc)
    backcast = pd.DataFrame(data=np.full(backcast_length, 2),
                            columns=['TypeId'])

    backcast['id2'] = None
    backcast['Daily'] = 1
    backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
    backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
    backcast['ForecastMethodId'] = 1
    backcast['ForecastVolume'] = overflow_check(y_forc.values)
    backcast['CreatedBy'] = 'test'
    backcast['CreatedDt'] = pd.to_datetime('today')

    return backcast.values.tolist()

i = pd.date_range('05-01-2010', '21-05-2018', freq='D')
x = pd.DataFrame(index=i, data=np.random.randint(0, 100, len(i)))

t = time.perf_counter()
y = list_creation(x)
print(time.perf_counter()-t)

请注意,我只翻译了您的代码部分,没有包括任何其他内容。

英文:

I am trying to accelerate the code below that produces a list of lists with different types for each column. I originally created pandas dataframe and then converted it to list, but this seems to be fairly slow. How can I create this list faster, by say an order of magnitude? All columns are constant except one.

import pandas as pd
import numpy as np
import time
import datetime

def overflow_check(x):
    # in SQL code the column is decimal(13, 2)
    p=13
    s=3
    max_limit = float(&quot;9&quot;*(p-s) + &quot;.&quot; + &quot;9&quot;*s)
    #min_limit =  0.01 #float(&quot;0&quot; + &quot;.&quot; + &quot;0&quot;*(s-2) + &#39;1&#39;)
    #min_limit = 0.1
    if np.logical_not(isinstance(x, np.ndarray)) or len(x) &lt; 1:
        raise Exception(&quot;Non-numeric or empty array.&quot;)
    else:
        #print(x)
        return x * (np.abs(x) &lt; max_limit) + np.sign(x)* max_limit * (np.abs(x) &gt;= max_limit)

def list_creation(y_forc):
    
    
    backcast_length = len(y_forc)
    
    backcast = pd.DataFrame(data=np.full(backcast_length, 2),
                            columns=[&#39;TypeId&#39;])


    backcast[&#39;id2&#39;] = None
    backcast[&#39;Daily&#39;] = 1
    backcast[&#39;ForecastDate&#39;] = y_forc.index.strftime(&#39;%Y-%m-%d&#39;)
    backcast[&#39;ReportDate&#39;] = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
    backcast[&#39;ForecastMethodId&#39;] = 1
    backcast[&#39;ForecastVolume&#39;] = overflow_check(y_forc.values)
    backcast[&#39;CreatedBy&#39;] = &#39;test&#39;
    backcast[&#39;CreatedDt&#39;] = pd.to_datetime(&#39;today&#39;)

  
    return backcast.values.tolist()

i=pd.date_range(&#39;05-01-2010&#39;, &#39;21-05-2018&#39;, freq=&#39;D&#39;)
x=pd.DataFrame(index=i, data = np.random.randint(0, 100, len(i)))

t=time.perf_counter()
y =list_creation(x)
print(time.perf_counter()-t)

答案1

得分: 1

Sure, here's the translated code you provided:

这应该会快一些它直接创建了列表

    def list_creation1(y_forc):
        zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
        t = pd.to_datetime('today').strftime('%Y-%m-%d')
        t1 = pd.to_datetime('today')
        return [
            [2, None, 1, i, t,
            1, v, 'test', t1] 
            for i,v in zipped
        ]

    %%timeit
    list_creation(x)
    > 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

    %%timeit
    list_creation1(x)
    > 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

编辑减慢速度的一个主要问题是从日期时间转换为指定格式所需的时间如果我们可以通过以下方式来消除这个问题

    def list_creation1(i, v):
        zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
        t = pd.to_datetime('today').strftime('%Y-%m-%d')
        t1 = pd.to_datetime('today')
        return [
            [2, None, 1, i, t,
            1, v, 'test', t1] 
            for i,v in zipped
        ]
    
    start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
    end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
    i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
    x = np.random.randint(0, 100, len(i))

然后这现在会快得多

    %%timeit
    list_creation1(i, x)
    > 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Please note that due to formatting constraints, some special characters and symbols might not appear exactly as in the original code.

英文:

This should be a bit faster, it just directly creates the list:

def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime(&#39;%Y-%m-%d&#39;), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
t1 =pd.to_datetime(&#39;today&#39;)
return [
[2, None, 1, i, t,
1, v, &#39;test&#39;, t1] 
for i,v in zipped
]
%%timeit
list_creation(x)
&gt; 29.3 ms &#177; 468 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
&gt; 17.1 ms &#177; 517 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

Edit: one of the large issues with the slowness is the time it takes to go from datetime to specified format. if we can get rid of that by phrasing it as the following:

def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
t1 =pd.to_datetime(&#39;today&#39;)
return [
[2, None, 1, i, t,
1, v, &#39;test&#39;, t1] 
for i,v in zipped
]
start = datetime.datetime.strptime(&quot;05-01-2010&quot;, &quot;%d-%m-%Y&quot;)
end = datetime.datetime.strptime(&quot;21-05-2018&quot;, &quot;%d-%m-%Y&quot;)
i = [(start + datetime.timedelta(days=x)).strftime(&quot;%d-%m-%Y&quot;) for x in range(0, (end-start).days)]
x=np.random.randint(0, 100, len(i))

Then this is now a lot faster:

%%timeit
list_creation1(i, x)
&gt; 1.87 ms &#177; 24.5 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

huangapple
  • 本文由 发表于 2020年1月6日 23:57:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615191.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定