Efficient maping of large pandas dataframe (by index)

huangapple go评论164阅读模式
英文:

Efficient maping of large pandas dataframe (by index)

问题

I have found bottle neck.
我发现了瓶颈。

I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers).
我有一个名为df的数据框,其中有一个名为'Numbers'的列,其中包含从1到100的数字(整数)。

I would like to map those numbers with dictionary.
我想使用字典将这些数字映射起来。

I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.:
我知道我可以使用.map()或.replace()函数,但似乎这两种解决方案都很慢,并且没有考虑到'Numbers'列中的数字是我的字典的索引(它是一个系列),即:

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.
不幸的是,apply函数也非常慢。有没有其他更快的方法?数据框有5000万多条记录。

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.
我尝试过来自pandas包的.map()、replace()和.apply()函数,但性能非常差。我想提高计算速度。

英文:

I'm currently optimising my code and I have found bottle neck.
I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers). I would like to map those numbers with dictionary. I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.: I would like to perform the following:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

df['Numbers_with_a']=df['Numbers'].apply(lambda x: dict_simple[x])

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.

答案1

得分: 1

pandas.Series有一个索引,可以在pandas中本地映射一个值到另一个值,而无需调用apply来遍历每一行或将值转换为Python的int类型。由于你想要映射的数字从零开始,而Series默认从0开始索引,你可以

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map是从你的"a0"等字符串创建的Seriesstr_map.iloc[df.numbers]使用你的数字作为索引,为你提供了一个新的映射值的Series。该系列以你的数字为索引,所以你将其索引删除,并将结果赋回原始数据框。

英文:

pandas.Series have an index that can be used to map one value to another natively in pandas without the extra expense of calling apply for each row or converting values to python int type. Since the numbers you want to map start from zero and a Series indexes from 0 by default, you can

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map is a Series created from your "a0"... strings. str_map.iloc[df.numbers] uses your numbers as indicies, giving you a new Series of the mapped values. That series is indexed by your numbers, so you drop that index and assign the result back to the original dataframe.

答案2

得分: 1

将你的列表转换为NumPy数组,并按如下方式映射它们:

dict_simple = []
for i in range(100):
    dict_simple.append('a' + str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]
英文:

convert your list to numpy array and map them as below:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]

答案3

得分: 0

感谢所有的回答。我已经进行了一些比较:

import pandas as pd
import time
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, size=(10000000, 1)), columns=list('N'))

dict_dictionary = {}
dict_list = []
for i in range(10):
    dict_dictionary[i] = 'a' + str(i)
    dict_list.append('a' + str(i))
dict_array = np.array(dict_list)
dict_series = pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time = time.time()
df['Numbers_map'] = df['N'].map(dict_dictionary)
print(time.time() - start_time)

print('replace')
start_time = time.time()
df['Numbers_replace'] = df['N']
df['Numbers_replace'].replace(dict_dictionary, inplace=True)
print(time.time() - start_time)

print('array')
start_time = time.time()
df['Numbers_array'] = dict_array[df['N'].values]
print(time.time() - start_time)

print('series')
start_time = time.time()
df['Numbers_series'] = dict_series.iloc[df.N].reset_index(drop=True)
print(time.time() - start_time)

print('end')

结果如下:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" 对于小字典可以提高性能,但总体来说,使用 Series 的方法是最快的。

英文:

Thanks for all answers. I have done some comparison:

import pandas as pd
import time
import numpy as np

df=pd.DataFrame(np.random.randint(1,10,size=(10000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(10):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time=time.time()
df['Numbers_map']=df['N'].map(dict_dictionary)
print(time.time()-start_time)

print('replace')
start_time=time.time()
df['Numbers_replace']=df['N']
df['Numbers_replace'].replace(dict_dictionary,inplace=True)
print(time.time()-start_time)

print('array')
start_time=time.time()
df['Numbers_array']=dict_array[df['N'].values]
print(time.time()-start_time)

print('series')
start_time=time.time()
df['Numbers_series']=dict_series.iloc[df.N].reset_index(drop=True)
print(time.time()-start_time)

print('end')

Results are as follows:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" gains some performance for small dictionaries, but overall approach with series is the fastest.

答案4

得分: 0

我已更新numpy和pandas到最新版本,现在"map"方法与"series"方法非常接近。计算时间取决于字典的大小。有时"map"更好,有时"series"更好。另外,我尝试了并行的pandas来利用所有CPU,但性能比"map"差。很可能p_map性能较差,因为我只使用了2个CPU。

import pandas as pd
from parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series(['a'+str(i) for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)
英文:

I have updated numpy and pandas to the newest versions and right now "map" is very close to "series" approach. Computing time depends on the size of dictonary. Sometimes "map" is better and sometimes "series".
Also I have tried parallell pandas to use all CPUs, but performance was worse than "map". Quite probably p_map performance is worse because I'm using only 2 CPU.

import pandas as pd
from  parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)

huangapple
  • 本文由 发表于 2023年8月11日 04:41:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76879189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定