英文:
Are map() and reduce() appropriate for concurrent processing in Go?
问题
从Python背景出发,刚开始学习Go语言,我发现自己在Go中寻找map()和reduce()函数的等效方法。我没有找到它们,所以回到了使用for循环。例如,这是我在map()函数中使用的代码,其中mapFunction在其他地方定义:
data := make([]byte, 1024)
count, err := input.Read(data) // 这个片段中省略了错误处理
for i:=0; i<count; i++ {
data[i] = mapFunction(data[i])
}
这是我在reduce()函数中使用的代码,其中有两个状态变量,我使用它们来跟踪CSV中每个项目的字段引用:
data := make([]byte, 1024)
count, err := input.Read(data) // 这个片段中省略了错误处理
for i:=0; i<count; i++ {
data[i], stateVariable1, stateVariable2 =
reduceFunction(data[i], stateVariable1, stateVariable2)
}
以下是我的问题:
- 我是否错过了内置的功能来实现这个?
- 对于这两个函数,使用可变切片是否合适?
- 对于map()函数,使用goroutines是否是一个好主意?这样是否可以将读取文件的IO操作与对每个项目运行映射函数的过程解耦,并因此实现并行化?
- 是否正确地说,对于reduce()函数来说,goroutines不合适,因为这两个状态变量由所有前面的数据定义,并且必须按顺序进行处理。换句话说,这个顺序过程无法从并发架构中获益?
谢谢!
附注 - 完整的代码在这里:https://github.com/dbro/csvquote/blob/go/csvquote.go
英文:
Coming from a python background, and just starting with Go, I found myself looking for the equivalent of the map() and reduce() functions in Go. I didn't find them, so fell back on for loops. For example, this is what I used instead of a map(), where mapFunction is defined elsewhere:
data := make([]byte, 1024)
count, err := input.Read(data) // error handling removed from this snippet
for i:=0; i<count; i++ {
data[i] = mapFunction(data[i])
}
and this is what I used instead of a reduce(), where there are 2 state variables that I'm using to keep track of quoting of fields in a CSV as the code moves through each item in the slice:
data := make([]byte, 1024)
count, err := input.Read(data) // error handling removed from this snippet
for i:=0; i<count; i++ {
data[i], stateVariable1, stateVariable2 =
reduceFunction(data[i], stateVariable1, stateVariable2)
}
Here are my questions:
- Are there builtin capabilties for this that I missed?
- Is it appropriate to use mutable slices for each of these?
- Would it be a good idea to use goroutines for the map()? Would that allow decoupling of the IO operation to read the file and the process to run the mapping function on each item, and therefore allow parallelization?
- Is it correct to say that goroutines would not be appropriate for the reduce() function because the 2 state variables are defined by all of the preceding data, and it must proceed sequentially. In other words, this sequential process cannot benefit from concurrent architecture?
Thanks!
ps - the full code is here: https://github.com/dbro/csvquote/blob/go/csvquote.go
答案1
得分: 4
- 不,Go语言没有内置的map或reduce函数。
- 是的,还有其他什么问题吗?
- 不,不要在没有先进行测量或证明真正需要的情况下考虑这种东西。
- 是的。
稍微详细一些。
- Go语言不是函数式的,没有内置的map/reduce函数或标准库中的相关函数。
- Go语言中有数组和切片。两者都是可变的。大部分情况下,切片是更自然的选择。
- 过早优化...,但当然:读取和处理可以放在一个循环中,将输入包装在bufio.Reader中可能是一个好主意。
- Goroutines很好用,它们允许不同类型的程序构建,但这并不意味着它们适用于所有情况。没有必要通过引入goroutines来使一个完全清晰的for循环变得复杂。
英文:
In short:
- No, there is no builtin map or reduce.
- Yes. What else?
- No. Do not even think about such stuff without prior measuring or some proven real need.
- Yes.
A bit longer.
- Go is not functional, no map/reduce builtins or in the standard library
- There are array and slices in Go. Both are mutable. Slices are the natural choice most of the time.
- Premature optimization... , but of course: Reading an processing could go into one loop and wrapping input in a bufio.Reader could be a good idea.
- Goroutines are nice, they allow a different type of program construction, but that does not mean that they are to be used for everything. There is no need to complicate a perfectly clear for loop by introducing goroutines.
答案2
得分: 1
Volker给出了一个很好的答案,但它并没有发挥Go的主要优势之一,即并发性。通过使用“服务器农场”策略,可以并行化(抛开过早优化)类似于map/reduce的操作。这涉及将要完成的工作划分为工作包,然后将其发送到单独的工作线程(即goroutines)。Map/Reduce是一种通用的方法,需要高阶函数和不可变数据结构。
尽管Go不是一种函数式语言,但它足够灵活,可以允许定制的并行分解。虽然没有不可变性,但通过使用复制语义来避免别名,从而在goroutines之间交换值时消除了竞态条件,这实际上是一样好的。简单来说:在共享时直接使用结构体而不是结构体指针。(而且为了帮助,Go1.1中有一个新的竞态检测器)。
服务器农场模式是实现高并行效率的好方法,因为它是自平衡的。这与几何分解(即通过将数据网格分成区域并将其分配给处理器)和算法分解(即将管道中的不同阶段分配给不同的处理器)形成对比,这两种方法都可能导致负载不平衡。Go能够表达这三种类型的分解。
英文:
Volker has given a good answer, but it doesn't play to one of Go's main strengths, which is its concurrency. A map/reduce type of operation may be parallelized (premature optimization aside) through the use of a 'server farm' strategy. This involves dividing the work to be done into work packets that are sent to separate workers (i.e. goroutines). Map/Reduce is a generic way of doing this and requires higher order functions and immutable data structures.
Go is flexible enough to allow a bespoke parallel decomposition even though it isn't a functional language. Although there's no immutability, it allows aliasing to be avoided through the use of copy semantics, thereby eliminating race conditions when values are exchanged between goroutines, which is effectively as good. Put simply: use structs directly instead of pointers to structs when sharing. (And to help, there's a new race detector in Go1.1).
The server farm pattern is a good way of achieving high parallelization efficiencies because it is self-balancing. This contrasts with geometric decompositions (i.e. sharing a grid of data by clumping zones and allocating them to processors) and with algorithmic decompositions (i.e. allocating different stages in a pipeline to different processors), both of which suffer from potentially unbalanced load. Go is capable of expressing all three kinds.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论