2023年8月11日 04:41:51go评论164阅读模式

英文:

Efficient maping of large pandas dataframe (by index)

问题

I have found bottle neck.
我发现了瓶颈。

I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers).
我有一个名为df的数据框，其中有一个名为'Numbers'的列，其中包含从1到100的数字（整数）。

I would like to map those numbers with dictionary.
我想使用字典将这些数字映射起来。

I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.:
我知道我可以使用.map()或.replace()函数，但似乎这两种解决方案都很慢，并且没有考虑到'Numbers'列中的数字是我的字典的索引（它是一个系列），即：

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.
不幸的是，apply函数也非常慢。有没有其他更快的方法？数据框有5000万多条记录。

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.
我尝试过来自pandas包的.map()、replace()和.apply()函数，但性能非常差。我想提高计算速度。

英文:

I'm currently optimising my code and I have found bottle neck.
I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers). I would like to map those numbers with dictionary. I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.: I would like to perform the following:

dict_simple=[]
for i in range(100):
    dict_simple.append(&#39;a&#39; +str(i))

df[&#39;Numbers_with_a&#39;]=df[&#39;Numbers&#39;].apply(lambda x: dict_simple[x])

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.

答案1

得分: 1

pandas.Series有一个索引，可以在pandas中本地映射一个值到另一个值，而无需调用apply来遍历每一行或将值转换为Python的int类型。由于你想要映射的数字从零开始，而Series默认从0开始索引，你可以

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map是从你的"a0"等字符串创建的Series。str_map.iloc[df.numbers]使用你的数字作为索引，为你提供了一个新的映射值的Series。该系列以你的数字为索引，所以你将其索引删除，并将结果赋回原始数据框。

英文:

pandas.Series have an index that can be used to map one value to another natively in pandas without the extra expense of calling apply for each row or converting values to python int type. Since the numbers you want to map start from zero and a Series indexes from 0 by default, you can

import pandas as pd

df = pd.DataFrame({&quot;numbers&quot;:[1,4,22,7,99]})
str_map = pd.Series([f&#39;a{i}&#39; for i in range(100)])
df[&#39;numbers_with_a&#39;] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map is a Series created from your "a0"... strings. str_map.iloc[df.numbers] uses your numbers as indicies, giving you a new Series of the mapped values. That series is indexed by your numbers, so you drop that index and assign the result back to the original dataframe.

答案2

得分: 1

将你的列表转换为NumPy数组，并按如下方式映射它们：

dict_simple = []
for i in range(100):
    dict_simple.append('a' + str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]

英文:

convert your list to numpy array and map them as below:

dict_simple=[]
for i in range(100):
    dict_simple.append(&#39;a&#39; +str(i))

dict_array = np.array(dict_simple)
df[&#39;Numbers_with_a&#39;] = dict_array[df[&#39;Numbers&#39;].values]

答案3

得分: 0

感谢所有的回答。我已经进行了一些比较：

import pandas as pd
import time
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, size=(10000000, 1)), columns=list('N'))

dict_dictionary = {}
dict_list = []
for i in range(10):
    dict_dictionary[i] = 'a' + str(i)
    dict_list.append('a' + str(i))
dict_array = np.array(dict_list)
dict_series = pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time = time.time()
df['Numbers_map'] = df['N'].map(dict_dictionary)
print(time.time() - start_time)

print('replace')
start_time = time.time()
df['Numbers_replace'] = df['N']
df['Numbers_replace'].replace(dict_dictionary, inplace=True)
print(time.time() - start_time)

print('array')
start_time = time.time()
df['Numbers_array'] = dict_array[df['N'].values]
print(time.time() - start_time)

print('series')
start_time = time.time()
df['Numbers_series'] = dict_series.iloc[df.N].reset_index(drop=True)
print(time.time() - start_time)

print('end')

结果如下：

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" 对于小字典可以提高性能，但总体来说，使用 Series 的方法是最快的。

英文:

Thanks for all answers. I have done some comparison:

import pandas as pd
import time
import numpy as np

df=pd.DataFrame(np.random.randint(1,10,size=(10000000,1)), columns=list(&#39;N&#39;))

dict_dictionary={}
dict_list=[]
for i in range(10):
    dict_dictionary[i]=&#39;a&#39; + str(i)
    dict_list.append(&#39;a&#39; + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f&#39;a{i}&#39; for i in range(10)])

print(&#39;map&#39;)
start_time=time.time()
df[&#39;Numbers_map&#39;]=df[&#39;N&#39;].map(dict_dictionary)
print(time.time()-start_time)

print(&#39;replace&#39;)
start_time=time.time()
df[&#39;Numbers_replace&#39;]=df[&#39;N&#39;]
df[&#39;Numbers_replace&#39;].replace(dict_dictionary,inplace=True)
print(time.time()-start_time)

print(&#39;array&#39;)
start_time=time.time()
df[&#39;Numbers_array&#39;]=dict_array[df[&#39;N&#39;].values]
print(time.time()-start_time)

print(&#39;series&#39;)
start_time=time.time()
df[&#39;Numbers_series&#39;]=dict_series.iloc[df.N].reset_index(drop=True)
print(time.time()-start_time)

print(&#39;end&#39;)

Results are as follows:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" gains some performance for small dictionaries, but overall approach with series is the fastest.

答案4

得分: 0

我已更新numpy和pandas到最新版本，现在"map"方法与"series"方法非常接近。计算时间取决于字典的大小。有时"map"更好，有时"series"更好。另外，我尝试了并行的pandas来利用所有CPU，但性能比"map"差。很可能p_map性能较差，因为我只使用了2个CPU。

import pandas as pd
from parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series(['a'+str(i) for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)

英文:

I have updated numpy and pandas to the newest versions and right now "map" is very close to "series" approach. Computing time depends on the size of dictonary. Sometimes "map" is better and sometimes "series".
Also I have tried parallell pandas to use all CPUs, but performance was worse than "map". Quite probably p_map performance is worse because I'm using only 2 CPU.

import pandas as pd
from  parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list(&#39;N&#39;))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]=&#39;a&#39; + str(i)
    dict_list.append(&#39;a&#39; + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f&#39;a{i}&#39; for i in range(100)])
print(&#39;p_map&#39;)
start_time=time.time()
df[&#39;Numbers_p_map&#39;]=df[&#39;N&#39;].p_map(dict_dictionary)
print(time.time()-start_time)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Efficient maping of large pandas dataframe (by index)

问题

答案1

答案2

答案3

答案4

线性规划求解器忽略约束条件。

透视表按行总计

epoch taking too long,

Django的`.first()`方法从一个Decimal中去掉第一个数字。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论