英文:
Go: Thread-Safe Concurrency Issue with Sparse Array Read & Write
问题
我正在使用Go编写一个搜索引擎,其中我有一个倒排索引,将每个单词映射到相应的结果。有一个固定的单词字典,因此这些单词已经转换为StemID
,它是从0开始的整数。这使得我可以使用指针切片(即稀疏数组
)将每个StemID
映射到包含该查询结果的结构体。例如,var StemID_to_Index []*resultStruct
。如果aardvark
是0
,那么aardvark
的resultStruct指针位于StemID_to_Index[0]
,如果该单词的结果当前未加载,则该指针将为nil
。
服务器上的内存不足以将所有内容存储在内存中,因此每个StemID
的结构将保存为单独的文件,并且可以加载到StemID_to_Index
切片中。如果对于此StemID
,StemID_to_Index
当前为nil
,则表示结果未缓存,需要加载;否则,它已经加载(缓存),因此可以直接使用。每次加载新结果时,都会检查内存使用情况,如果超过阈值,则丢弃已加载结果的2/3(对这些StemID将StemID_to_Index
设置为nil
,并强制进行垃圾回收)。
我的问题是并发性。在不同的线程同时搜索时,我应该采用哪种最快且最有效的方式,而不会出现不同线程同时尝试读写相同位置的问题?我试图避免在所有地方都使用互斥锁,因为那样会减慢每次访问的速度。
你认为我可以在工作线程中从磁盘加载结果,然后使用通道将指向该结构的指针传递给一个“更新器”线程,该线程将StemID_to_Index
切片中的nil
值更新为加载结果的指针?这意味着两个线程永远不会同时尝试写入,但是如果另一个线程在“更新器”线程更新指针时尝试从StemID_to_Index
的确切索引处读取,会发生什么?如果线程为当前正在加载的结果给出了nil
指针,那没关系,因为它将被加载两次,虽然这会浪费资源,但仍会提供相同的结果,而且由于这种情况不太可能经常发生,所以可以原谅。
另外,工作线程如何知道“更新器”线程何时完成对切片中指针的更新?它应该休眠并不断检查,还是“更新器”可以向推送到通道的特定线程发送一条消息的简单方法?
更新
我编写了一个小的测试脚本,以查看在同时访问指针和修改指针时会发生什么...似乎一切都正常。没有错误。我是否漏掉了什么?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
更新2
进一步说,即使我从多个线程同时读取和写入同一变量...也没有任何区别,仍然没有错误:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
这意味着我根本不需要担心并发性...再次强调:我是否漏掉了什么?
英文:
I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID
, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array
) to map each StemID
to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct
. If aardvark
is 0
then the pointer to the resultStruct for aardvark
is located at StemID_to_Index[0]
, which will be nil
if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID
will be saved as separate files and these can be loaded into the StemID_to_Index
slice. If StemID_to_Index
is currently nil
for this StemID
then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index
is set to nil
for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil
value in the StemID_to_Index
slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index
while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil
pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?
答案1
得分: 1
这听起来是使用内存映射文件的完美用例:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// 打开后备文件
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// 设置文件大小
f.Truncate(1024)
// 内存映射
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m 是一个字节切片
copy(m, "Hello World")
m.Flush()
// 这是如何使用指针的示例
type Coordinate struct{ X, Y int }
// 首先将内存地址作为 *byte 指针获取,并将其转换为 unsafe.Pointer
ptr := unsafe.Pointer(&m[20])
// 然后将其转换为不同的指针类型
coord := (*Coordinate)(ptr)
// 现在可以直接使用它
*coord = Coordinate{1, 2}
m.Flush()
// 反之亦然
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
内存映射可以比实际内存大,操作系统会为您处理所有混乱的细节。
您仍然需要确保不同的 goroutine 不会同时读写同一段内存。
英文:
This sounds like a perfect use case for a memory mapped file:
package main
import (
"log"
"os"
"unsafe"
"github.com/edsrzf/mmap-go"
)
func main() {
// Open the backing file
f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
if err != nil {
log.Fatalln(err)
}
defer f.Close()
// Set it's size
f.Truncate(1024)
// Memory map it
m, err := mmap.Map(f, mmap.RDWR, 0)
if err != nil {
log.Fatalln(err)
}
defer m.Unmap()
// m is a byte slice
copy(m, "Hello World")
m.Flush()
// here's how to use it with a pointer
type Coordinate struct{ X, Y int }
// first get the memory address as a *byte pointer and convert it to an unsafe
// pointer
ptr := unsafe.Pointer(&m[20])
// next convert it into a different pointer type
coord := (*Coordinate)(ptr)
// now you can use it directly
*coord = Coordinate{1, 2}
m.Flush()
// and vice-versa
log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}
The memory map can be larger than real memory and the operating system will handle all the messy details for you.
You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.
答案2
得分: 0
我的首选答案是使用elasticsearch,并使用类似elastigo的客户端。
如果这不是一个选项,了解你对并发行为的重视程度将非常有帮助。如果你不在意,并发的写操作可能会在读操作完成后立即发生,这样读操作的用户将得到旧的数据。你可以使用一个写操作和读操作的队列,并让多个线程将操作添加到队列中,然后由一个调度程序逐个将操作发送到映射表中。在其他所有情况下,如果存在多个读者和写者,你将需要使用互斥锁。在Go语言中,映射表不是线程安全的。
不过,老实说,我建议你先简单地添加一个互斥锁,然后通过分析瓶颈所在的位置来进行优化。你检查阈值然后清除缓存的2/3似乎有点随意,我不会感到意外,如果你这样做可能会降低性能。以下是一个可能导致问题的情况:
请求者1、2、3和4经常访问文件A和B上的许多相同的单词。
请求者5、6、7和8经常访问文件C和D上的许多相同的单词。
现在,当这些请求者和文件之间的请求交错发生时,你可能会一次又一次地清除你的2/3缓存,而这些结果可能会在不久之后被请求。还有其他几种方法:
- 在同一台服务器上缓存同时频繁访问的单词,并使用多个缓存服务器。
- 根据单词的流行程度进行逐个缓存。如果在缓存已满的情况下从文件中访问一个新单词,查看该文件中是否存在其他更受欢迎的单词,并清除缓存中较不受欢迎的条目,以期提高这些单词的命中率。
- 同时采用方法1和方法2。
英文:
My top answer would be to use elasticsearch with a client like elastigo.
If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.
Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:
Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B.
Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.
Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:
- Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
- Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
- Both approaches 1 & 2.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论