2022年3月22日 02:23:31go评论118阅读模式

英文:

Add unique values in an array as a value in concurrent map golang?

问题

我正在遍历flatProduct.Catalogs切片，并在Go语言中填充我的productCatalog并发映射。我正在使用upsert方法，以便只将唯一的productID添加到我的productCatalog映射中。

这段代码使用线性扫描来检查重复的产品ID，但在我的情况下，我有超过70万个产品ID，所以速度非常慢。我正在寻找提高效率的方法。

下面的代码由多个并行的goroutine调用，所以我在这里使用并发映射来填充数据。

var productRows []ClientProduct
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
    return err
}
for i := range productRows {
    flatProduct, err := r.Convert(spn, productRows[i])
    if err != nil {
        return err
    }
    if flatProduct.StatusCode == definitions.DONE {
        continue
    }
    r.products.Set(strconv.Itoa(flatProduct.ProductId, 10), flatProduct)
    for _, catalogId := range flatProduct.Catalogs {
        catalogValue := strconv.FormatInt(int64(catalogId), 10)
        // 如何改进下面的Upsert代码，使其在我的情况下运行更快？
        r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
            productID := newValue.(int64)
            if valueInMap == nil {
                return []int64{productID}
            }
            oldIDs := valueInMap.([]int64)
            for _, id := range oldIDs {
                if id == productID {
                    // 已存在，不添加重复项。
                    return oldIDs
                }
            }
            return append(oldIDs, productID)
        })
    }
}

上述的upsert代码对我来说非常慢，它花费了很多时间将唯一的产品ID作为值添加到我的并发映射中。这是productCatalog的定义方式：

productCatalog *cmap.ConcurrentMap

这是我正在使用的upsert方法-https://github.com/orcaman/concurrent-map/blob/master/concurrent_map.go#L56

这是我从这个cmap中读取数据的方式：

catalogProductMap := clientRepo.GetProductCatalogMap()
productIds, ok := catalogProductMap.Get("200")
var data = productIds.([]int64)
for _, pid := range data {
  ...
}

英文:

I am iterating over flatProduct.Catalogs slice and populating my productCatalog concurrent map in golang. I am using upsert method so that I can add only unique productID's into my productCatalog map.

This uses a linear scan to check for duplicate product IDs but in my case I have more than 700k productId's so it is very slow for me. I am looking for ways to make it more efficient.

Below code is called by multiple goroutines in parallel that is why I am using concurrent map here to populate data into it.

var productRows []ClientProduct
err = json.Unmarshal(byteSlice, &amp;productRows)
if err != nil {
    return err
}
for i := range productRows {
    flatProduct, err := r.Convert(spn, productRows[i])
    if err != nil {
        return err
    }
    if flatProduct.StatusCode == definitions.DONE {
        continue
    }
    r.products.Set(strconv.Itoa(flatProduct.ProductId, 10), flatProduct)
    for _, catalogId := range flatProduct.Catalogs {
        catalogValue := strconv.FormatInt(int64(catalogId), 10)
        // how can I improve below Upsert code for `productCatalog` map so that it can runs faster for me?
        r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
            productID := newValue.(int64)
            if valueInMap == nil {
                return []int64{productID}
            }
            oldIDs := valueInMap.([]int64)
            for _, id := range oldIDs {
                if id == productID {
                    // Already exists, don&#39;t add duplicates.
                    return oldIDs
                }
            }
            return append(oldIDs, productID)
        })
    }
}

Above upsert code is very slow for me and it takes lot of time to add unique product id's as a value in my concurrent map. Here is how productCatalog is defined.

productCatalog *cmap.ConcurrentMap

Here is the upsert method which I am using - https://github.com/orcaman/concurrent-map/blob/master/concurrent_map.go#L56

This is how I am reading data from this cmap:

catalogProductMap := clientRepo.GetProductCatalogMap()
productIds, ok := catalogProductMap.Get(&quot;200&quot;)
var data = productIds.([]int64)
for _, pid := range data {
  ...
}

答案1

得分: 1

总结评论中的答案：

upsert函数的时间复杂度是O(n**2)，其中n是切片的长度。

问题正如你所提到的，遍历整个切片以查找重复项是一个问题。可以使用另一个映射来避免这个问题。

示例：

r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
	productID := newValue.(int64)
	if valueInMap == nil {
		return map[int64]struct{}{productID: {}}
	}
	oldIDs := valueInMap.(map[int64]struct{})
    
    // value is irrelevant, no need to check if key exists 
	oldIDs[productID] = struct{}{}
	return oldIDs
})

嵌套映射会增加很多内存分配，导致内存使用量增加，对吗？

不，使用空结构体不会创建新的分配或增加内存使用量。你可以找到很多关于空结构体及其用法的文章和问题（例如：https://stackoverflow.com/questions/47544156/what-uses-a-type-with-empty-struct-has-in-go）。

注意：你可以使用一些优化的数组搜索方法，例如sort.Search中使用的二分搜索，但是它需要有序数组。

英文:

To summarize answers from comments:

> The upsert function is O(n**2) where n is the length of the slice.

The problem as you also mentioned is iterating through whole slice to find duplicate. This can be avoided using another map.

Example:

r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
	productID := newValue.(int64)
	if valueInMap == nil {
		return map[int64]struct{}{productID: {}}
	}
	oldIDs := valueInMap.(map[int64]struct{})
    
    // value is irrelevant, no need to check if key exists 
	oldIDs[productID] = struct{}{}
	return oldIDs
})

> Nested map will add lot of allocation causing lot of memory usage right?

Nope, using empty struct won't create new allocations or increase memory usage. You can find plenty of articles/questions about empty struct and its usage. (e. g. https://stackoverflow.com/questions/47544156/what-uses-a-type-with-empty-struct-has-in-go)

Note: you could use some kind of optimised search for array like binary search used by sort.Search, but it requires sorted array.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Add unique values in an array as a value in concurrent map golang?

问题

答案1

总结评论中的答案：

To summarize answers from comments:

Unmarshalling XML using Go: How to find attributes with the same value?

无法同时使用goroutines来查找最大值，直到上下文被取消。

如何告诉json.Unmarshal使用结构体而不是接口。

在Python中按日期时间对字典按值进行排序

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。