英文:
string vs integer as map key for memory utilization in golang?
问题
我有一个名为read
的函数,被多个go routines
调用来读取s3
文件,并填充两个concurrent map,如下所示:
- 在服务器启动期间,调用下面的
read
函数来填充两个concurrent map。 - 每隔30秒,定期调用
read
函数来读取新的s3
文件,并再次填充两个concurrent map,以获取一些新数据。
因此,在整个应用程序的整个生命周期中的任何给定时间点,我的两个concurrent map都有一些数据,并且也会定期更新。
func (r *clientRepository) read(file string, bucket string) error {
var err error
//... 读取s3文件
for {
rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
if err != nil {
return errs.Wrap(err)
}
if len(rows) <= 0 {
break
}
byteSlice, err := json.Marshal(rows)
if err != nil {
return errs.Wrap(err)
}
var productRows []ParquetData
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return errs.Wrap(err)
}
for i := range productRows {
var flatProduct definitions.CustomerInfo
err = r.ConvertData(spn, &productRows[i], &flatProduct)
if err != nil {
return errs.Wrap(err)
}
// 在这里填充第一个concurrent map
r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
for _, catalogId := range flatProduct.Catalogs {
strCatalogId := strconv.FormatInt(int64(catalogId), 10)
// 在这里更新第二个concurrent map
r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return map[int64]struct{}{productID: {}}
}
oldIDs := valueInMap.(map[int64]struct{})
// value is irrelevant, no need to check if key exists
oldIDs[productID] = struct{}{}
return oldIDs
})
}
}
}
return nil
}
在上述代码中,flatProduct.ProductId
和strCatalogId
是整数,但我将它们转换为字符串,因为concurrent map只能使用字符串作为键。然后,我有以下三个函数,用于从上述填充的concurrent map中获取数据:
func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
return r.products
}
func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
return r.productCatalog
}
func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
pd, ok := r.products.Get(pid)
if ok {
return pd.(*definitions.CustomerInfo)
}
return nil
}
我有一个使用案例,需要从多个go routines中填充map,然后从一组主应用程序线程中读取这些map中的数据,因此它需要是线程安全的,并且在从主线程读取数据时速度应该足够快,而不需要太多的锁定。
问题陈述
我正在处理大量数据,大约30-40 GB的数据来自我读入内存的所有这些文件。我在这里使用concurrent map解决了大部分并发问题,但是concurrent map的键是string
,它没有实现整数键的功能。在我的情况下,键只是一个可以是int32的产品ID,所以将所有这些产品ID存储为字符串在内存中是否值得?我认为与将所有这些键存储为整数相比,字符串分配需要更多的内存。至少在C/C++
中是这样的,所以我认为在Golang
中也应该是同样的情况。
在使用map时,有什么可以改进的地方,以减少内存使用量,并且在从主线程读取这些map中的数据时不会丧失性能?
我正在使用这个repo中的concurrent map
,它没有整数键的实现。
更新
我正在尝试在我的代码中使用cmap_int
来尝试它。
type clientRepo struct {
customers *cmap.ConcurrentMap
customersCatalog *cmap.ConcurrentMap
}
func NewClientRepository(logger log.Logger) (ClientRepository, error) {
// ....
customers := cmap.New[string]()
customersCatalog := cmap.New[string]()
r := &clientRepo{
customers: &customers,
customersCatalog: &customersCatalog,
}
// ....
return r, nil
}
但是我得到以下错误:
Cannot use '&products' (type *ConcurrentMap[V]) as the type *cmap.ConcurrentMap
我需要更改clientRepo
结构中的什么,以便它可以与使用泛型的新版本的concurrent map一起工作?
英文:
I have a below read
function which is called by multiple go routines
to read s3
files and it populates two concurrent map as shown below.
- During server startup, it calls
read
function below to populate two concurrent map. - And also periodically every 30 seconds, it calls
read
function again to read new s3 files and populate two concurrent map again with some new data.
So basically at a given state of time during the whole lifecycle of this app, both my concurrent map
have some data and also periodically being updated too.
func (r *clientRepository) read(file string, bucket string) error {
var err error
//... read s3 file
for {
rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
if err != nil {
return errs.Wrap(err)
}
if len(rows) <= 0 {
break
}
byteSlice, err := json.Marshal(rows)
if err != nil {
return errs.Wrap(err)
}
var productRows []ParquetData
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
return errs.Wrap(err)
}
for i := range productRows {
var flatProduct definitions.CustomerInfo
err = r.ConvertData(spn, &productRows[i], &flatProduct)
if err != nil {
return errs.Wrap(err)
}
// populate first concurrent map here
r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
for _, catalogId := range flatProduct.Catalogs {
strCatalogId := strconv.FormatInt(int64(catalogId), 10)
// upsert second concurrent map here
r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
productID := newValue.(int64)
if valueInMap == nil {
return map[int64]struct{}{productID: {}}
}
oldIDs := valueInMap.(map[int64]struct{})
// value is irrelevant, no need to check if key exists
oldIDs[productID] = struct{}{}
return oldIDs
})
}
}
}
return nil
}
In above code flatProduct.ProductId
and strCatalogId
are integer but I am converting them into string bcoz concurrent map works with string only. And then I have below three functions which is used by my main application threads to get data from the concurrent map populated above.
func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
return r.products
}
func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
return r.productCatalog
}
func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
pd, ok := r.products.Get(pid)
if ok {
return pd.(*definitions.CustomerInfo)
}
return nil
}
I have a use case where I need to populate map from multiple go routines and then read data from those maps from bunch of main application threads so it needs to be thread safe and it should be fast enough as well without much locking.
Problem Statement
I am dealing with lots of data like 30-40 GB worth of data from all these files which I am reading into memory. I am using concurrent map here which solves most of my concurrency issues but the key for the concurrent map is string
and it doesn't have any implementation where key can be integer. In my case key is just a product id which can be int32 so is it worth it storing all those product id's as string in this concurrent map? I think string allocation takes more memory compare to storing all those keys as integer? At least it does in c/c++
so I am assuming it should be same case here in golang
too.
Is there anything I can to improve here w.r.t map usage so that I can reduce memory utilization plus I don't lose performance as well while reading data from these maps from main threads?
I am using concurrent map
from this repo which doesn't have implementation for key as integer.
Update
I am trying to use cmap_int in my code to try it out.
type clientRepo struct {
customers *cmap.ConcurrentMap
customersCatalog *cmap.ConcurrentMap
}
func NewClientRepository(logger log.Logger) (ClientRepository, error) {
// ....
customers := cmap.New[string]()
customersCatalog := cmap.New[string]()
r := &clientRepo{
customers: &customers,
customersCatalog: &customersCatalog,
}
// ....
return r, nil
}
But I am getting error as:
Cannot use '&products' (type *ConcurrentMap[V]) as the type *cmap.ConcurrentMap
What I need to change in my clientRepo
struct so that it can work with new version of concurrent map which uses generics?
答案1
得分: 3
我不知道Go语言中并发映射的具体实现细节,但如果它使用字符串作为键,我猜背后可能会同时存储字符串和字符串的哈希值(用于实际的索引操作)。
这会占用相当多的内存,对此无法做任何处理,因为并发映射只使用字符串作为键。
如果有一种使用整数的映射,它可能也会使用这些整数的哈希值。平滑的哈希分布是实现良好且均匀的查找性能所必需的特性,尤其是在键数据本身不均匀分布的情况下。这就像你需要一个非常简单的映射实现!
我在想,如果你的产品ID适合32位(或者可以修改为32位,或者其他可接受的整数长度),那么一个简单的数组是否足够。是的,这样你会分配大量的内存,可能有大片未使用的区域。然而,索引非常快速,并且操作系统的虚拟内存子系统会确保你不使用的数组区域不会被换入内存。注意 - 我在这里非常注重C语言和固定大小对象的概念,对于Go语言来说可能不适用,所以这可能是一个错误的建议。
为了坚持下去,只要数组没有暗示在分配时进行初始化(例如,在C语言中,数组不会被编译器初始化),分配并不意味着所有的元素都在内存中,而且只有数组中最常用的部分会在RAM中,这得益于操作系统的虚拟内存子系统。
编辑
你可以有一个数组的映射,其中每个数组覆盖一定范围的产品ID。这将近似产生相同的效果,以存储哈希和字符串为代价,换取存储空引用。如果产品ID以某种结构化方式聚集在一起,这可能效果很好。
另外,只是一个想法,我对Go语言的了解非常有限。Go语言是否通过引用存储对象?如果是这样,那么一个对象数组实际上是一个引用数组(因此大小固定),只有在需要时才分配实际的对象(即数组中很多元素是空引用)。这对于我之前提到的大数组的建议听起来不太好...
英文:
I don't know the implementation details of concurrent map in Go, but if it's using a string as a key I'm guessing that behind the scenes it's storing both the string and a hash of the string (which will be used for actual indexing operations).
That is going to be something of a memory hog, and there'll be nothing that can be done about that as concurrent map uses only strings for key.
If there were some sort of map that did use integers, it'd likely be using hashes of those integers anyway. A smooth hash distribution is a necessary feature for good and uniform lookup performance, in the event that key data itself is not uniformly distributed. It's almost like you need a very simple map implementation!
I'm wondering if a simple array would do, if your product ID's fit within 32bits (or can be munged to do so, or down to some other acceptable integer length). Yes, that way you'd have a large amount of memory allocated, possibly with large tracts unused. However, indexing is super-rapid, and the OS's virtual memory subsystem would ensure that areas of the array that you don't index aren't swapped in. Caveat - I'm thinking very much in terms of C and fixed-size objects here - less so Go - so this may be a bogus suggestion.
To persevere, so long as there's nothing about the array that implies initialisation-on-allocation (e.g. in C the array wouldn't get initialised by the compiler), allocation doesn't automatically mean it's all in memory, all at once, and only the most commonly used areas of the array will be in RAM courtesy of the OS's virtual memory subsystem.
EDIT
You could have a map of arrays, where each array covered a range of product Ids. This would be close to the same effect, trading off storage of hashes and strings against storage of null references. If product ids are clumped in some sort of structured way, this could work well.
Also, just a thought, and I'm showing a total lack of knowledge of Go here. Does Go store objects by reference? In which case wouldn't an array of objects actually be an array of references (so, fixed in size) and the actual objects allocated only as needed (ie a lot of the array is null references)? That doesn't sound good for my one big array suggestion...
答案2
得分: 2
你使用的库相对简单,你可以将所有的string
替换为int32
(并修改哈希函数),它仍然可以正常工作。
我对替换后的版本进行了一个小型(不太严格)的基准测试,结果如下:
$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4 10 174272711 ns/op 49009948 B/op 33873 allocs/op
BenchmarkCMapAllocSS-4 10 369259624 ns/op 102535456 B/op 1082125 allocs/op
BenchmarkCMapUpdateAlloc-4 10 114794162 ns/op 0 B/op 0 allocs/op
BenchmarkCMapUpdateAllocSS-4 10 192165246 ns/op 16777216 B/op 1048576 allocs/op
BenchmarkCMap-4 10 1193068438 ns/op 5065 B/op 41 allocs/op
BenchmarkCMapSS-4 10 2195078437 ns/op 536874022 B/op 33554471 allocs/op
带有SS
后缀的基准测试是原始的字符串版本。因此,使用整数作为键占用的内存更少,运行速度更快,这是可以预期的。字符串版本每次插入大约多分配了50字节的内存(这不是实际的内存使用情况)。
基本上,Go中的字符串只是一个结构体:
type stringStruct struct {
str unsafe.Pointer
len int
}
因此,在64位机器上,至少需要8字节(指针)+ 8字节(长度)+ len(底层字节)
字节来存储一个字符串。将其转换为int32或int64肯定会节省内存。但是,我认为CustomerInfo和目录集占用的内存最多,我不认为会有很大的改进。
(顺便说一句,调整库中的SHARD_COUNT
可能也会有所帮助。)
英文:
The library you use is relatively simple and you may just replace all string
into int32
(and modify the hashing function) and it will still work fine.
I ran a tiny (and not that rigorous) benchmark against the replaced version:
$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4 10 174272711 ns/op 49009948 B/op 33873 allocs/op
BenchmarkCMapAllocSS-4 10 369259624 ns/op 102535456 B/op 1082125 allocs/op
BenchmarkCMapUpdateAlloc-4 10 114794162 ns/op 0 B/op 0 allocs/op
BenchmarkCMapUpdateAllocSS-4 10 192165246 ns/op 16777216 B/op 1048576 allocs/op
BenchmarkCMap-4 10 1193068438 ns/op 5065 B/op 41 allocs/op
BenchmarkCMapSS-4 10 2195078437 ns/op 536874022 B/op 33554471 allocs/op
Benchmarks with a SS
suffix is the original string version. So using integers as keys takes less memory and runs faster, as anyone would expect. The string version allocates about 50 bytes more each insertion. (This is not the actual memory usage though.)
Basically, a string in go is just a struct:
type stringStruct struct {
str unsafe.Pointer
len int
}
So on a 64-bit machine, it takes at least 8 bytes (pointer) + 8 bytes (length) + len(underlying bytes)
bytes to store a string. Turning it into a int32 or int64 will definitely save memory. However, I assume that CustomerInfo and the catalog sets takes the most memory and I don't think there will be a great improvement.
(By the way, tuning the SHARD_COUNT
in the library might also help a bit.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论