如何并发地搜索一个巨大的 maps[string]string 切片?

huangapple go评论71阅读模式
英文:

How to search a huge slice of maps[string]string concurrently

问题

我需要搜索一个巨大的maps[string]string切片。我的想法是利用Go语言的通道和Go协程来实现并行搜索。

计划是将切片分成多个部分,并并行地进行搜索。但是令我震惊的是,我的并行版本在整个切片的搜索中超时了。

我不确定我做错了什么。下面是我用来测试这个概念的代码。实际的代码可能涉及更复杂的情况。

// 搜索给定的项
// 此函数接收需要搜索的数据和搜索项,并返回匹配的映射
// 数据很简单,映射包含{ key: andSomeText }
func Search(data []map[string]string, term string) []map[string]string {

	set := []map[string]string{}

	for _, v := range data {
		if v["key"] == term {

			set = append(set, v)
		}

	}
	return set

}

这段代码很好地搜索了切片中给定的搜索项。

现在我想,如果我的切片有20000个条目,我想并行搜索。

// 并行搜索所有记录
// 具有与搜索函数相同的函数签名
// 但主要任务是将切片分成5个部分并并行搜索
func All(data []map[string]string, term string) []map[string]string {
	countOfSlices := 5

	part := len(data) / countOfSlices

	fmt.Printf("数据大小:%v\n", len(data))
	fmt.Printf("片段大小:%v\n", part)

	timeout := time.After(60000 * time.Millisecond)

	c := make(chan []map[string]string)

	for i := 0; i < countOfSlices; i++ {
		// 将数组的片段传递给搜索方法
		go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()

	}

	result := []map[string]string{}

	for i := 0; i < part-1; i++ {
		select {
		case records := <-c:
			result = append(result, records...)
		case <-timeout:
			fmt.Println("超时!")
			return result
		}
	}
	return result
}

这是我的测试:

我有一个生成测试数据的函数和两个测试。

func GenerateTestData(search string) ([]map[string]string, int) {
	rand.Seed(time.Now().UTC().UnixNano())
	strin := []string{"String One", "This", "String Two", "String Three", "String Four", "String Five"}
	var matchCount int
	numOfRecords := 20000
	set := []map[string]string{}
	for i := 0; i < numOfRecords; i++ {
		p := rand.Intn(len(strin))
		s := strin[p]
		if s == search {
			matchCount++
		}
		set = append(set, map[string]string{"key": s})
	}
	return set, matchCount
}

两个测试:第一个只遍历切片,第二个并行搜索。

func TestSearchItem(t *testing.T) {

	tests := []struct {
		InSearchTerm string
		Fn           func(data []map[string]string, term string) []map[string]string
	}{
		{
			InSearchTerm: "This",
			Fn:           Search,
		},
		{
			InSearchTerm: "This",
			Fn:           All,
		},
	}

	for i, test := range tests {

		startTime := time.Now()
		data, expectedMatchCount := GenerateTestData(test.InSearchTerm)
		result := test.Fn(data, test.InSearchTerm)

		fmt.Printf("测试[%v]:\n时间:%v \n\n", i+1, time.Since(startTime))
		assert.Equal(t, len(result), expectedMatchCount, "期望:%v 实际:%v", len(result), expectedMatchCount)

	}
}

如果有人能解释一下为什么我的并行代码如此慢,代码有什么问题,我在这里漏掉了什么,以及搜索大型内存切片(50K+)的推荐方法,那就太好了。

英文:

I need to search a huge slice of maps[string]string. My thought was that this is a good chance for go's channel and go routines.

The Plan was to divide the slice in parts and send search them in parallel.
But I was kind of shocked that my parallel version timed out while the search of the whole slice did the trick.

I am not sure what I am doing wrong. Down below is my code which I used to test the concept. The real code would involve more complexity

//Search for a giving term
//This function gets the data passed which will need to be search
//and the search term and it will return the matched maps
// the data is pretty simply the map contains { key: andSomeText }
func Search(data []map[string]string, term string) []map[string]string {

	set := []map[string]string{}

	for _, v := range data {
		if v[&quot;key&quot;] == term {

			set = append(set, v)
		}

	}
	return set

}

So this works pretty well to search the slice of maps for a given SearchTerm.

Now I thought if my slice would have like 20K entries, I would like to do the search in parallel

// All searches all records concurrently
// Has the same function signature as the the search function
// but the main task is to fan out the slice in 5 parts and search
// in parallel
func All(data []map[string]string, term string) []map[string]string {
	countOfSlices := 5

	part := len(data) / countOfSlices

	fmt.Printf(&quot;Size of the data:%v\n&quot;, len(data))
	fmt.Printf(&quot;Fragemnt Size:%v\n&quot;, part)

	timeout := time.After(60000 * time.Millisecond)

	c := make(chan []map[string]string)

	for i := 0; i &lt; countOfSlices; i++ {
        // Fragments of the array passed on to the search method
		go func() { c &lt;- Search(data[(part*i):(part*(i+1))], term) }()

	}

	result := []map[string]string{}

	for i := 0; i &lt; part-1; i++ {
		select {
		case records := &lt;-c:
			result = append(result, records...)
		case &lt;-timeout:
			fmt.Println(&quot;timed out!&quot;)
			return result
		}
	}
	return result
}

Here are my tests:

I have a function to generate my test data and 2 tests.

func GenerateTestData(search string) ([]map[string]string, int) {
	rand.Seed(time.Now().UTC().UnixNano())
	strin := []string{&quot;String One&quot;, &quot;This&quot;, &quot;String Two&quot;, &quot;String Three&quot;, &quot;String Four&quot;, &quot;String Five&quot;}
	var matchCount int
	numOfRecords := 20000
	set := []map[string]string{}
	for i := 0; i &lt; numOfRecords; i++ {
		p := rand.Intn(len(strin))
		s := strin

if s == search { matchCount++ } set = append(set, map[string]string{&quot;key&quot;: s}) } return set, matchCount }

The 2 tests: The first just traverses the slice and the second searches in parallel

func TestSearchItem(t *testing.T) {

	tests := []struct {
		InSearchTerm string
		Fn           func(data []map[string]string, term string) []map[string]string
	}{
		{
			InSearchTerm: &quot;This&quot;,
			Fn:           Search,
		},
		{InSearchTerm: &quot;This&quot;,
			Fn: All,
		},
	}

	for i, test := range tests {

		startTime := time.Now()
		data, expectedMatchCount := GenerateTestData(test.InSearchTerm)
		result := test.Fn(data, test.InSearchTerm)

		fmt.Printf(&quot;Test: [%v]:\nTime: %v \n\n&quot;, i+1, time.Since(startTime))
		assert.Equal(t, len(result), expectedMatchCount, &quot;expected: %v to be: %v&quot;, len(result), expectedMatchCount)

	}
}

It would be great if someone could explain me why my parallel code is so slow? What is wrong with the code and what I am missing here as well as what the recommended way would be to search huge slices in memory 50K+.

答案1

得分: 4

这看起来只是一个简单的拼写错误。问题在于你将原始的大片切成了5块(countOfSlices),并且你正确地启动了5个goroutine来搜索每一部分:

for i := 0; i < countOfSlices; i++ {
    // 将数组的片段传递给搜索方法
    go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()
}

这意味着你应该期望5个结果,但实际上并不是这样。你期望得到4000-1个结果:

for i := 0; i < part-1; i++ {
    select {
    case records := <-c:
        result = append(result, records...)
    case <-timeout:
        fmt.Println("超时!")
        return result
    }
}

显然,如果你只启动了5个goroutine,每个goroutine只传递一个结果,那么你只能期望得到这么多(5个)结果。而且由于你的循环等待了更多的结果(这些结果永远不会到达),所以它按预期超时了。

将条件改为以下形式:

for i := 0; i < countOfSlices; i++ {
    // ...
}
英文:

This looks like just a simple typo. The problem is that you divide your original big slice into 5 pieces (countOfSlices), and you properly launch 5 goroutines to search each part:

for i := 0; i &lt; countOfSlices; i++ {
    // Fragments of the array passed on to the search method
    go func() { c &lt;- Search(data[(part*i):(part*(i+1))], term) }()

}

This means you should expect 5 results, but you don't. You expect 4000-1 results:

for i := 0; i &lt; part-1; i++ {
    select {
    case records := &lt;-c:
        result = append(result, records...)
    case &lt;-timeout:
        fmt.Println(&quot;timed out!&quot;)
        return result
    }
}

Obviously if you only launched 5 goroutines, each of which delivers 1 single result, you can only expect as many (5). And since your loop waits a lot more (which will never come), it times out as expected.

Change the condition to this:

for i := 0; i &lt; countOfSlices; i++ {
    // ...
}

答案2

得分: 1

并发不等于并行。Go是一种高度并发的语言,而不是并行的。即使在使用多核机器时,当在计算线程中访问共享切片时,你仍然需要付出在CPU之间进行数据交换的代价。你可以利用并发性,在搜索过程中只找到第一个匹配项,或者在继续搜索的同时对结果进行一些操作(比如打印、写入某个写入器或排序)。

func Search(data []map[string]string, term string, ch chan map[string]string) {
    for _, v := range data {
        if v["key"] == term {
            ch <- v
        }
    }
}

func main() {
    ...
    go Search(datapart1, term, ch)
    go Search(datapart2, term, ch)
    go Search(datapart3, term, ch)
    ...
    for vv := range ch {
        fmt.Println(vv) //并发地处理匹配项
    }
    ...
}

搜索大型切片的推荐方法是保持其排序或构建二叉树。没有什么魔法可言。

英文:

Concurrency is not parallelism. Go is massively concurrent language, not parallel. Even using multicore machine you will pay for data exchange between CPUs when accessing your shared slice in computation threads. You can take advantage of concurrency searching just first match for example. Or doing something with results(say print them, or write to some Writer, or sort) while continue to search.

func Search(data []map[string]string, term string, ch chan map[string]string) {
    for _, v := range data {
        if v[&quot;key&quot;] == term {
            ch &lt;- v
        }
    }
}
func main(){
...
    go search(datapart1, term, ch)
    go search(datapart2, term, ch)
    go search(datapart3, term, ch)
...
    for vv := range ch{
        fmt.Println(vv) //do something with match concurrently
    }
...
}

The recommended way to search huge slice would be to keep it sorted, or make binary tree. There are no magic.

答案3

得分: 0

有两个问题 - 正如icza所指出的,你没有完成select,因为你需要使用countOfSlices,而且你调用Search的时候也无法获取到你想要的数据,因为你需要在调用go func()之前分配它,所以在外部分配slice并将其传递进去。

然而,你可能会发现即使使用并行处理这种简单数据,速度仍然不会更快(也许在具有大量核心的机器上处理更复杂的数据时才值得尝试)。

在测试时,请确保尝试交换测试运行的顺序 - 你可能会对结果感到惊讶!另外,也许尝试一下测试包中提供的基准测试工具,它会多次运行你的代码并对结果进行平均,这可能有助于你更好地了解扇出是否加快了速度。

英文:

There are two problems - as icza notes you never finish the select as you need to use countOfSlices, and then also your call to Search will not get the data you want as you need to allocate that before calling the go func(), so allocate the slice outside and pass it in.

You might find it still isn't faster though to do this particular work in parallel with such simple data (perhaps with more complex data on a machine with lots of cores it would be worthwhile)?

Be sure when testing that you try swapping the order of your test runs - you might be surprised by the results! Also perhaps try the benchmarking tools available in the testing package which runs your code lots of times for you and averages the results, this might help you get a better idea of whether the fanout speeds things up.

huangapple
  • 本文由 发表于 2017年1月22日 02:52:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/41783180.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定