英文:
Add unique values in an array as a value in concurrent map golang?
问题
我正在遍历flatProduct.Catalogs
切片,并在Go语言中填充我的productCatalog
并发映射。我正在使用upsert方法,以便只将唯一的productID
添加到我的productCatalog
映射中。
这段代码使用线性扫描来检查重复的产品ID,但在我的情况下,我有超过70万个产品ID,所以速度非常慢。我正在寻找提高效率的方法。
下面的代码由多个并行的goroutine调用,所以我在这里使用并发映射来填充数据。
var productRows []ClientProduct
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return err
}
for i := range productRows {
flatProduct, err := r.Convert(spn, productRows[i])
if err != nil {
return err
}
if flatProduct.StatusCode == definitions.DONE {
continue
}
r.products.Set(strconv.Itoa(flatProduct.ProductId, 10), flatProduct)
for _, catalogId := range flatProduct.Catalogs {
catalogValue := strconv.FormatInt(int64(catalogId), 10)
// 如何改进下面的Upsert代码,使其在我的情况下运行更快?
r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return []int64{productID}
}
oldIDs := valueInMap.([]int64)
for _, id := range oldIDs {
if id == productID {
// 已存在,不添加重复项。
return oldIDs
}
}
return append(oldIDs, productID)
})
}
}
上述的upsert代码对我来说非常慢,它花费了很多时间将唯一的产品ID作为值添加到我的并发映射中。这是productCatalog
的定义方式:
productCatalog *cmap.ConcurrentMap
这是我正在使用的upsert
方法-https://github.com/orcaman/concurrent-map/blob/master/concurrent_map.go#L56
这是我从这个cmap中读取数据的方式:
catalogProductMap := clientRepo.GetProductCatalogMap()
productIds, ok := catalogProductMap.Get("200")
var data = productIds.([]int64)
for _, pid := range data {
...
}
英文:
I am iterating over flatProduct.Catalogs
slice and populating my productCatalog
concurrent map in golang. I am using upsert method so that I can add only unique productID's
into my productCatalog
map.
This uses a linear scan to check for duplicate product IDs but in my case I have more than 700k productId's so it is very slow for me. I am looking for ways to make it more efficient.
Below code is called by multiple goroutines in parallel that is why I am using concurrent map here to populate data into it.
var productRows []ClientProduct
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return err
}
for i := range productRows {
flatProduct, err := r.Convert(spn, productRows[i])
if err != nil {
return err
}
if flatProduct.StatusCode == definitions.DONE {
continue
}
r.products.Set(strconv.Itoa(flatProduct.ProductId, 10), flatProduct)
for _, catalogId := range flatProduct.Catalogs {
catalogValue := strconv.FormatInt(int64(catalogId), 10)
// how can I improve below Upsert code for `productCatalog` map so that it can runs faster for me?
r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return []int64{productID}
}
oldIDs := valueInMap.([]int64)
for _, id := range oldIDs {
if id == productID {
// Already exists, don't add duplicates.
return oldIDs
}
}
return append(oldIDs, productID)
})
}
}
Above upsert code is very slow for me and it takes lot of time to add unique product id's as a value in my concurrent map. Here is how productCatalog
is defined.
productCatalog *cmap.ConcurrentMap
Here is the upsert
method which I am using - https://github.com/orcaman/concurrent-map/blob/master/concurrent_map.go#L56
This is how I am reading data from this cmap:
catalogProductMap := clientRepo.GetProductCatalogMap()
productIds, ok := catalogProductMap.Get("200")
var data = productIds.([]int64)
for _, pid := range data {
...
}
答案1
得分: 1
总结评论中的答案:
upsert函数的时间复杂度是O(n**2),其中n是切片的长度。
问题正如你所提到的,遍历整个切片以查找重复项是一个问题。可以使用另一个映射来避免这个问题。
示例:
r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return map[int64]struct{}{productID: {}}
}
oldIDs := valueInMap.(map[int64]struct{})
// value is irrelevant, no need to check if key exists
oldIDs[productID] = struct{}{}
return oldIDs
})
嵌套映射会增加很多内存分配,导致内存使用量增加,对吗?
不,使用空结构体不会创建新的分配或增加内存使用量。你可以找到很多关于空结构体及其用法的文章和问题(例如:https://stackoverflow.com/questions/47544156/what-uses-a-type-with-empty-struct-has-in-go)。
注意:你可以使用一些优化的数组搜索方法,例如sort.Search中使用的二分搜索,但是它需要有序数组。
英文:
To summarize answers from comments:
> The upsert function is O(n**2) where n is the length of the slice.
The problem as you also mentioned is iterating through whole slice to find duplicate. This can be avoided using another map.
Example:
r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return map[int64]struct{}{productID: {}}
}
oldIDs := valueInMap.(map[int64]struct{})
// value is irrelevant, no need to check if key exists
oldIDs[productID] = struct{}{}
return oldIDs
})
> Nested map will add lot of allocation causing lot of memory usage right?
Nope, using empty struct won't create new allocations or increase memory usage. You can find plenty of articles/questions about empty struct and its usage. (e. g. https://stackoverflow.com/questions/47544156/what-uses-a-type-with-empty-struct-has-in-go)
Note: you could use some kind of optimised search for array like binary search used by sort.Search, but it requires sorted array.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论