2020年1月6日 23:57:45go评论104阅读模式

英文:

Fast method to create nested list with different types: numpy, pandas or list concatenation?

问题

我正在尝试加速下面的代码，该代码生成一个具有每列不同类型的列表列表。我最初创建了pandas数据帧，然后将其转换为列表，但这似乎相当慢。如何能够以更快的速度创建这个列表，例如提高一个数量级？除了一个列之外，所有列都是恒定的。

import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
    # 在SQL代码中，该列是decimal(13, 2)
    p=13
    s=3
    max_limit = float("9"*(p-s) + "." + "9"*s)
    if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
        raise Exception("Non-numeric or empty array.")
    else:
        return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)
def list_creation(y_forc):
    backcast_length = len(y_forc)
    backcast = pd.DataFrame(data=np.full(backcast_length, 2),
                            columns=['TypeId'])
    backcast['id2'] = None
    backcast['Daily'] = 1
    backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
    backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
    backcast['ForecastMethodId'] = 1
    backcast['ForecastVolume'] = overflow_check(y_forc.values)
    backcast['CreatedBy'] = 'test'
    backcast['CreatedDt'] = pd.to_datetime('today')
    return backcast.values.tolist()
i = pd.date_range('05-01-2010', '21-05-2018', freq='D')
x = pd.DataFrame(index=i, data=np.random.randint(0, 100, len(i)))
t = time.perf_counter()
y = list_creation(x)
print(time.perf_counter()-t)

请注意，我只翻译了您的代码部分，没有包括任何其他内容。

英文:

I am trying to accelerate the code below that produces a list of lists with different types for each column. I originally created pandas dataframe and then converted it to list, but this seems to be fairly slow. How can I create this list faster, by say an order of magnitude? All columns are constant except one.

import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
    # in SQL code the column is decimal(13, 2)
    p=13
    s=3
    max_limit = float(&quot;9&quot;*(p-s) + &quot;.&quot; + &quot;9&quot;*s)
    #min_limit =  0.01 #float(&quot;0&quot; + &quot;.&quot; + &quot;0&quot;*(s-2) + &#39;1&#39;)
    #min_limit = 0.1
    if np.logical_not(isinstance(x, np.ndarray)) or len(x) &lt; 1:
        raise Exception(&quot;Non-numeric or empty array.&quot;)
    else:
        #print(x)
        return x * (np.abs(x) &lt; max_limit) + np.sign(x)* max_limit * (np.abs(x) &gt;= max_limit)
def list_creation(y_forc):
    
    
    backcast_length = len(y_forc)
    
    backcast = pd.DataFrame(data=np.full(backcast_length, 2),
                            columns=[&#39;TypeId&#39;])
    backcast[&#39;id2&#39;] = None
    backcast[&#39;Daily&#39;] = 1
    backcast[&#39;ForecastDate&#39;] = y_forc.index.strftime(&#39;%Y-%m-%d&#39;)
    backcast[&#39;ReportDate&#39;] = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
    backcast[&#39;ForecastMethodId&#39;] = 1
    backcast[&#39;ForecastVolume&#39;] = overflow_check(y_forc.values)
    backcast[&#39;CreatedBy&#39;] = &#39;test&#39;
    backcast[&#39;CreatedDt&#39;] = pd.to_datetime(&#39;today&#39;)
  
    return backcast.values.tolist()
i=pd.date_range(&#39;05-01-2010&#39;, &#39;21-05-2018&#39;, freq=&#39;D&#39;)
x=pd.DataFrame(index=i, data = np.random.randint(0, 100, len(i)))
t=time.perf_counter()
y =list_creation(x)
print(time.perf_counter()-t)

答案1

得分: 1

Sure, here's the translated code you provided:

这应该会快一些，它直接创建了列表：
    def list_creation1(y_forc):
        zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
        t = pd.to_datetime('today').strftime('%Y-%m-%d')
        t1 = pd.to_datetime('today')
        return [
            [2, None, 1, i, t,
            1, v, 'test', t1] 
            for i,v in zipped
        ]
    %%timeit
    list_creation(x)
    > 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %%timeit
    list_creation1(x)
    > 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
编辑：减慢速度的一个主要问题是从日期时间转换为指定格式所需的时间。如果我们可以通过以下方式来消除这个问题：
    def list_creation1(i, v):
        zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
        t = pd.to_datetime('today').strftime('%Y-%m-%d')
        t1 = pd.to_datetime('today')
        return [
            [2, None, 1, i, t,
            1, v, 'test', t1] 
            for i,v in zipped
        ]
    
    start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
    end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
    i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
    x = np.random.randint(0, 100, len(i))
然后这现在会快得多：
    %%timeit
    list_creation1(i, x)
    > 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Please note that due to formatting constraints, some special characters and symbols might not appear exactly as in the original code.

英文:

This should be a bit faster, it just directly creates the list:

def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime(&#39;%Y-%m-%d&#39;), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
t1 =pd.to_datetime(&#39;today&#39;)
return [
[2, None, 1, i, t,
1, v, &#39;test&#39;, t1] 
for i,v in zipped
]
%%timeit
list_creation(x)
&gt; 29.3 ms &#177; 468 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
&gt; 17.1 ms &#177; 517 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

Edit: one of the large issues with the slowness is the time it takes to go from datetime to specified format. if we can get rid of that by phrasing it as the following:

def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime(&#39;today&#39;).strftime(&#39;%Y-%m-%d&#39;)
t1 =pd.to_datetime(&#39;today&#39;)
return [
[2, None, 1, i, t,
1, v, &#39;test&#39;, t1] 
for i,v in zipped
]
start = datetime.datetime.strptime(&quot;05-01-2010&quot;, &quot;%d-%m-%Y&quot;)
end = datetime.datetime.strptime(&quot;21-05-2018&quot;, &quot;%d-%m-%Y&quot;)
i = [(start + datetime.timedelta(days=x)).strftime(&quot;%d-%m-%Y&quot;) for x in range(0, (end-start).days)]
x=np.random.randint(0, 100, len(i))

Then this is now a lot faster:

%%timeit
list_creation1(i, x)
&gt; 1.87 ms &#177; 24.5 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

快速创建包含不同类型元素的嵌套列表的方法：numpy、pandas还是列表连接？

问题

答案1

Delete from treeview in tkinter

使用mgo在golang中将字符串数组按照接口数组进行排序。

如何使用Python计算工资？

Python smart_open在文档中的代码中引发了NotImplementedError错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。