英文:
why numpy.in1d() works more faster in my example but my object?
问题
The significant difference in speed between the two uses of np.in1d()
can be attributed to the size of the input arrays and the nature of the data they contain. Let me summarize the key factors contributing to this difference:
-
Array Size: In your first example (the object code), you are working with relatively small data frames (TABLE1 and TABLE2), while in the second example, you are dealing with much larger random arrays (arr1 and arr2).
-
Data Type: In the object code, you are using structured data frames with mixed data types (e.g., strings and floats). In the second example, you are working with homogeneously typed arrays of integers converted to strings.
-
Operation Complexity: The second example uses simple integer arrays, and
np.in1d()
can perform faster when operating on such homogeneous data compared to more complex structured data frames. -
Search Space: In the second example, you are searching for matches in random arrays, which can potentially result in more efficient search operations compared to searching in a structured data frame with a specific pattern.
To optimize your code in the object to achieve a similar speed, consider the following suggestions:
-
Array Size: If your object code consistently deals with small data frames, it may not be necessary to optimize further. However, if you expect larger data frames in the future, you may consider optimizing for larger datasets.
-
Data Type: If possible, use homogeneous data types within your data frames. For example, you can convert all relevant columns to string type if appropriate, as you did in the second example.
-
Operation Complexity: Simplify your data structures and operations whenever possible. If you can represent your data more efficiently without using structured data frames, it might improve performance.
-
Parallelism: Depending on your hardware and the nature of your data, you might explore parallel computing options (e.g., using the
multiprocessing
module) to distribute the workload and potentially speed up the computation.
Remember that performance optimization often involves a trade-off between code complexity and speed. It's essential to strike a balance that suits your specific use case and data characteristics.
英文:
The 2 tables I have to deal with in my object like this:
TABLE1:
code_material | yield |
---|---|
1000100001 | 1000 |
1000100002 | 500 |
1000100003 | 1024 |
where code_material is a ten-digit number whose value does not exceed 4e+9
TABLE2:
code_material_manufacturing | input/output | code_material | qty |
---|---|---|---|
1000100001 | 1000154210 | IN | 100 |
1000100001 | 1123484257 | IN | 100 |
1000100001 | 1000100001 | OUT | 50 |
What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of 'code_material' in TABLE1 and 'input/output' in TABLE2 may have the character '-' be included, like'-1000100001', indicating that they are half-finished products.
I use str
as the dtypes for both cols, and np.in1d()
to do like this:
# read data from excel
TABLE1 = pd.read_excel(path1,
dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
dtype={'code_material':str, 'input/output':str,
'code_material':str, 'qty':float}
# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0] # To remove half-finished
table2 = TABLE2.to_numpy()
# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))
# operating segment
j = 0
output = output.tolist()
while j < len(output):
production = output[j]
idx_in_table2 = np.argwhere(table2[:, 0] == production)
# find those input casts
idx_input = idx_in_table2[:-1] # Sliced to prevent production from counting itself in.
input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
idx = np.inid(input, table2[:, 0]) # here's the in1d that confuses me.
It takes about 0.00423s each time.
But when I tried a similar instance, I found that np.in1d()
ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:
arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64) # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64) # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
s = time.perf_counter() #For the purpose of timing
idx = np.in1d(arr1, arr2)
cost += time.perf_counter() - s
print(cost/len(a))
I would like to ask what causes such a big difference in speed between the two in1d()
? Is it possible to use this cause to optimize my code in the object to this speed?
答案1
得分: 2
这是根据OP之前发布的评论构建的答案,如请求所示:
问题肯定是因为TABLE2.to_numpy()
导致了包含纯Python对象的Numpy数组。这些对象非常低效(无论是时间还是内存空间)。您需要选择一个特定的列,然后将其转换为Numpy数组。如果所有数据帧列都是相同类型,您使用的操作才会相对快速。此外,请注意,如@hpaulj所指出,比较字符串是昂贵的。要比较"1000100001"与另一个字符串,Numpy将使用基本循环比较每个10个字符,而将整数与另一个整数比较只需要大约1个指令。
另外,请注意,Pandas始终将字符串存储在纯Python对象中。据我所知,Numpy需要为每个对象更新引用计数,并关注锁定/释放GIL,更不用说需要内存间接引用,通常使用Unicode存储字符串,这往往更昂贵(需要额外的检查)。所有这些都比比较整数要昂贵得多。请重新考虑是否需要使用字符串。如果需要,您可以使用标记值(例如负整数),甚至可以将它们映射到一组预定义的字符串。
最后但并非最不重要的是,注意Pandas支持一种称为category
的类型。当唯一字符串的数量明显小于行数时,它通常比纯字符串快得多。
英文:
Here is an answer build from previously posted comments as requested by the OP:
The problem is certainly that TABLE2.to_numpy()
results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.
Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.
Last but not least, note that Pandas supports a type called category
. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论