numpy.in1d() 在我的示例中为什么比我的对象运行得更快?

huangapple go评论113阅读模式
英文:

why numpy.in1d() works more faster in my example but my object?

问题

The significant difference in speed between the two uses of np.in1d() can be attributed to the size of the input arrays and the nature of the data they contain. Let me summarize the key factors contributing to this difference:

  1. Array Size: In your first example (the object code), you are working with relatively small data frames (TABLE1 and TABLE2), while in the second example, you are dealing with much larger random arrays (arr1 and arr2).

  2. Data Type: In the object code, you are using structured data frames with mixed data types (e.g., strings and floats). In the second example, you are working with homogeneously typed arrays of integers converted to strings.

  3. Operation Complexity: The second example uses simple integer arrays, and np.in1d() can perform faster when operating on such homogeneous data compared to more complex structured data frames.

  4. Search Space: In the second example, you are searching for matches in random arrays, which can potentially result in more efficient search operations compared to searching in a structured data frame with a specific pattern.

To optimize your code in the object to achieve a similar speed, consider the following suggestions:

  1. Array Size: If your object code consistently deals with small data frames, it may not be necessary to optimize further. However, if you expect larger data frames in the future, you may consider optimizing for larger datasets.

  2. Data Type: If possible, use homogeneous data types within your data frames. For example, you can convert all relevant columns to string type if appropriate, as you did in the second example.

  3. Operation Complexity: Simplify your data structures and operations whenever possible. If you can represent your data more efficiently without using structured data frames, it might improve performance.

  4. Parallelism: Depending on your hardware and the nature of your data, you might explore parallel computing options (e.g., using the multiprocessing module) to distribute the workload and potentially speed up the computation.

Remember that performance optimization often involves a trade-off between code complexity and speed. It's essential to strike a balance that suits your specific use case and data characteristics.

英文:

The 2 tables I have to deal with in my object like this:

TABLE1:

code_material yield
1000100001 1000
1000100002 500
1000100003 1024

where code_material is a ten-digit number whose value does not exceed 4e+9

TABLE2:

code_material_manufacturing input/output code_material qty
1000100001 1000154210 IN 100
1000100001 1123484257 IN 100
1000100001 1000100001 OUT 50

What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of 'code_material' in TABLE1 and 'input/output' in TABLE2 may have the character '-' be included, like'-1000100001', indicating that they are half-finished products.

I use str as the dtypes for both cols, and np.in1d() to do like this:

# read data from excel
TABLE1 = pd.read_excel(path1, 
                       dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
                       dtype={'code_material':str, 'input/output':str,
                              'code_material':str, 'qty':float}

# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0]  # To remove half-finished

table2 = TABLE2.to_numpy()

# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))

# operating segment
j = 0
output = output.tolist()

while j < len(output):
  production = output[j]
  idx_in_table2 = np.argwhere(table2[:, 0] == production)
  # find those input casts
  idx_input = idx_in_table2[:-1]  # Sliced to prevent production from counting itself in.
  input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
  
  idx = np.inid(input, table2[:, 0])  #  here's the in1d that confuses me.
  

It takes about 0.00423s each time.

But when I tried a similar instance, I found that np.in1d() ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:

arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64)     # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64)  # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
  s = time.perf_counter()  #For the purpose of timing
  idx = np.in1d(arr1, arr2)
  cost += time.perf_counter() - s

print(cost/len(a))

I would like to ask what causes such a big difference in speed between the two in1d()? Is it possible to use this cause to optimize my code in the object to this speed?

答案1

得分: 2

这是根据OP之前发布的评论构建的答案,如请求所示:

问题肯定是因为TABLE2.to_numpy()导致了包含纯Python对象的Numpy数组。这些对象非常低效(无论是时间还是内存空间)。您需要选择一个特定的列,然后将其转换为Numpy数组。如果所有数据帧列都是相同类型,您使用的操作才会相对快速。此外,请注意,如@hpaulj所指出,比较字符串是昂贵的。要比较"1000100001"与另一个字符串,Numpy将使用基本循环比较每个10个字符,而将整数与另一个整数比较只需要大约1个指令。

另外,请注意,Pandas始终将字符串存储在纯Python对象中。据我所知,Numpy需要为每个对象更新引用计数,并关注锁定/释放GIL,更不用说需要内存间接引用,通常使用Unicode存储字符串,这往往更昂贵(需要额外的检查)。所有这些都比比较整数要昂贵得多。请重新考虑是否需要使用字符串。如果需要,您可以使用标记值(例如负整数),甚至可以将它们映射到一组预定义的字符串。

最后但并非最不重要的是,注意Pandas支持一种称为category的类型。当唯一字符串的数量明显小于行数时,它通常比纯字符串快得多。

英文:

Here is an answer build from previously posted comments as requested by the OP:

The problem is certainly that TABLE2.to_numpy() results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.

Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.

Last but not least, note that Pandas supports a type called category. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.

huangapple
  • 本文由 发表于 2023年7月24日 16:41:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定