简单的字符串MapReduce操作

huangapple go评论92阅读模式
英文:

Simple mapReduce operation on strings

问题

我有一个字符串列表:

elems := [n]string{...}

我想执行一个简单的mapReduce操作,使得:

  1. 将每个字符串映射到一个不同的字符串,比如 string -> $string

  2. 将所有字符串缩减为一个带有分隔符的字符串,例如 {s1, s2, s3} -> s1@s2@s3

总之:{s1, s2, s3} -> $s1@$s2@$s3

最佳的方法是什么?

我希望既高效又易读。

如果这个方法足够通用,不仅适用于字符串,那就更好了。

英文:

I have a list of strings

elems := [n]string{...}

I want to perform a simple mapReduce operation, such that I

  1. Map every string to a different string, let's say string -> $string

  2. Reduce all the strings to one string with a separator, e.g. {s1, s2, s3} -> s1@s2@s3

all in all: {s1, s2, s3} -> $s1@$s2@$s3

What's the best way to do this?

I'm looking for efficiency and readability

Bonus points if it's generic enough to work not only on strings

答案1

得分: 2

对于仅映射一个列表,你没有太多选择,只能遍历每个字符串。如果转换算法耗时且你需要速度,可以考虑拆分任务并使用go routine。最后,你可以使用strings.Join函数,该函数有一个选项可以指定分隔符,通常可以高效地执行reduce操作。数据集的大小也是一个考虑因素,对于较大的列表,你可能希望将strings.Join和自定义算法的性能进行比较,并决定是否使用多个go routine/通道来实现你想要的结果。

英文:

For mapping just a list, you won't have much choice other than to go over each string. If the transform algo is time-consuming and you need speed, you can consider splitting the job and use a go routine. Finally you can use the strings.Join function which has an option to specify a separator, this normally performs the reduce part efficiently. The size of the dataset can also be a consideration, and for larger sized lists you may want to compare performance with strings.Join and your own customized algo and see if you want to use multiple go routines/channels to achieve what you want to.

答案2

得分: 1

如果您不需要分别执行这两个操作,只需使用strings.Join()函数即可实现最终结果:

package main

import (
	"fmt"
	"strings"
)

func main() {
	a := []string{"a", "b", "c"}
	p := "$"
	fmt.Println(p + strings.Join(a[:], "@"+p))
}

输出结果为$a@$b@$c

playground

英文:

If you don't need to do the 2 things separately, the end result can be achieved simply by using strings.Join():

package main

import (
	"fmt"
	"strings"
)

func main() {
	a := []string{"a", "b", "c"}
	p := "$"
	fmt.Println(p + strings.Join(a[:], "@"+p))
}

prints $a@$b@$c

playground

答案3

得分: 0

Go不是一种显式的函数式编程语言。

你可以使用for循环来进行映射和归约。

a := []string{"a", "b", "c"}
result := "initvalue"
for n, i := range a {
  result += i + string(n)
}

请注意,以上是翻译的内容,不包括代码部分。

英文:

Go is explicitly NOT a functional programming language.

You map and reduce using a for loop.

a := []string{"a", "b", "c"}
result := "initvalue"
for n, i := range a {
  result += i + string(n)
}

答案4

得分: 0

如果您的map函数不执行任何IO操作(即它们只进行一些计算),将其并发化肯定会使其变慢,即使您正在执行一些IO操作,您也应该进行基准测试。并发不一定会使事情变得更快,有时会增加不必要的复杂性。在许多情况下,一个简单的for循环就足够了。

如果这里的map函数是IO绑定的,或者正在执行一些计算密集型的计算,并且从并发中受益,解决方案可能会有所不同。例如,可以使用NATS来超越一台机器并分发工作负载。

这是一个相对简单的示例。减少阶段不是多阶段的,是阻塞的:

import (
	"fmt"
	"strings"
	"sync"
	"testing"

	"github.com/stretchr/testify/assert"
)

type elem struct {
	index int
	value interface{}
}

func feed(elems []interface{}) <-chan elem {
	result := make(chan elem)
	go func() {
		for k, v := range elems {
			e := elem{
				index: k,
				value: v,
			}
			result <- e
		}
		close(result)
	}()
	return result
}

func mapf(
	input <-chan elem,
	mapFunc func(elem) elem) <-chan elem {
	result := make(chan elem)
	go func() {
		for e := range input {
			eres := mapFunc(e)
			result <- eres
		}
		close(result)
	}()
	return result
}

// is blocking
func reducef(
	input <-chan elem,
	reduceFunc func([]interface{}) interface{}) interface{} {
	buffer := make(map[int]interface{})
	l := 0
	for v := range input {
		buffer[v.index] = v.value
		if v.index > l {
			l = v.index
		}
	}
	data := make([]interface{}, l+1)
	for k, v := range buffer {
		data[k] = v
	}

	return reduceFunc(data)
}

func fanOutIn(
	elemFeed <-chan elem,
	mapFunc func(elem) elem, mapCount int,
	reduceFunc func([]interface{}) interface{}) interface{} {
	MR := make(chan elem)
	wg := &sync.WaitGroup{}
	for i := 0; i < mapCount; i++ {
		mapResult := mapf(elemFeed, mapFunc)

		wg.Add(1)
		go func() {
			defer wg.Done()
			for v := range mapResult {
				MR <- v
			}
		}()
	}
	go func() {
		wg.Wait()
		close(MR)
	}()
	return reducef(MR, reduceFunc)
}

func Test01(t *testing.T) {
	elemFeed := feed([]interface{}{1, 2, 3})
	finalResult := fanOutIn(
		elemFeed,
		func(e elem) elem {
			return elem{
				index: e.index,
				value: fmt.Sprintf("[%v]", e.value),
			}
		},
		3,
		func(sl []interface{}) interface{} {
			strRes := make([]string, len(sl))
			for k, v := range sl {
				strRes[k] = v.(string)
			}
			return strings.Join(strRes, ":")
		})
	assert.Equal(t, "[1]:[2]:[3]", finalResult)
}

由于它使用interface{}作为元素类型,因此可以进行泛化。

英文:

If you are not going to perform any sort of IO operations inside your map functions (means they are doing just some computations), making it concurrent would make it slower for sure and even if you are doing some IO, you should benchmark. Concurrency would not make things faster necessarily and some times add unnecessary complications. In many cases just a simple for loop is sufficient.

If the map functions here are IO bound or are doing some sort of computation heavy calculations that do benefit from going concurrent, solutions can vary. For example NATS can be used to go beyond one machine and distribute the workload.

This is a relatively simple sample. Reduce phase is not multistage and is blocking:

import (
&quot;fmt&quot;
&quot;strings&quot;
&quot;sync&quot;
&quot;testing&quot;
&quot;github.com/stretchr/testify/assert&quot;
)
type elem struct {
index int
value interface{}
}
func feed(elems []interface{}) &lt;-chan elem {
result := make(chan elem)
go func() {
for k, v := range elems {
e := elem{
index: k,
value: v,
}
result &lt;- e
}
close(result)
}()
return result
}
func mapf(
input &lt;-chan elem,
mapFunc func(elem) elem) &lt;-chan elem {
result := make(chan elem)
go func() {
for e := range input {
eres := mapFunc(e)
result &lt;- eres
}
close(result)
}()
return result
}
// is blocking
func reducef(
input &lt;-chan elem,
reduceFunc func([]interface{}) interface{}) interface{} {
buffer := make(map[int]interface{})
l := 0
for v := range input {
buffer[v.index] = v.value
if v.index &gt; l {
l = v.index
}
}
data := make([]interface{}, l+1)
for k, v := range buffer {
data[k] = v
}
return reduceFunc(data)
}
func fanOutIn(
elemFeed &lt;-chan elem,
mapFunc func(elem) elem, mapCount int,
reduceFunc func([]interface{}) interface{}) interface{} {
MR := make(chan elem)
wg := &amp;sync.WaitGroup{}
for i := 0; i &lt; mapCount; i++ {
mapResult := mapf(elemFeed, mapFunc)
wg.Add(1)
go func() {
defer wg.Done()
for v := range mapResult {
MR &lt;- v
}
}()
}
go func() {
wg.Wait()
close(MR)
}()
return reducef(MR, reduceFunc)
}
func Test01(t *testing.T) {
elemFeed := feed([]interface{}{1, 2, 3})
finalResult := fanOutIn(
elemFeed,
func(e elem) elem {
return elem{
index: e.index,
value: fmt.Sprintf(&quot;[%v]&quot;, e.value),
}
},
3,
func(sl []interface{}) interface{} {
strRes := make([]string, len(sl))
for k, v := range sl {
strRes[k] = v.(string)
}
return strings.Join(strRes, &quot;:&quot;)
})
assert.Equal(t, &quot;[1]:[2]:[3]&quot;, finalResult)
}

And since it uses interface{} as the element type, it can get generalized.

huangapple
  • 本文由 发表于 2017年7月20日 14:35:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/45206968.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定