2013年3月17日 00:04:18go评论132阅读模式

英文:

Is there a distributed data processing pipeline framework, or a good way to organize one?

问题

我正在设计一个应用程序，需要一个分布式的处理工作集，这些工作集需要以特定的流程异步地消费和产生数据。例如：

组件A获取页面。
组件B分析来自A的页面。
组件C存储来自B的分析结果。

显然，涉及的组件不止三个。

进一步的要求：

每个组件需要是一个独立的进程（或一组进程）。
生产者不知道消费者的任何信息。换句话说，组件A只是产生数据，不知道哪些组件消费了这些数据。

这是一种由面向拓扑的系统（如Storm）解决的数据流问题。虽然Storm看起来不错，但我对它持怀疑态度；它是一个Java系统，基于Thrift，而我对这两者都不是很喜欢。

我目前倾向于采用发布/订阅式的方法，使用AMQP作为数据传输，使用HTTP作为数据共享/存储的协议。这意味着AMQP队列模型成为了公共API — 换句话说，消费者需要知道生产者使用的AMQP主机和队列 — 对此我并不特别满意，但这可能是值得妥协的。

AMQP方法的另一个问题是每个组件都必须具有非常相似的逻辑来处理以下事项：

连接到队列
处理连接错误
将数据序列化/反序列化为通用格式
运行实际的工作程序（goroutines或分叉子进程）
动态扩展工作程序
容错性
节点注册
处理指标
队列限流
队列优先级（某些工作程序比其他工作程序不重要）

...以及其他每个组件都需要的许多其他细节。

即使消费者在逻辑上非常简单（考虑MapReduce作业，类似将文本拆分为标记），也有很多样板代码。当然，我可以自己做所有这些 — 我对AMQP和队列以及其他一切都非常熟悉 — 并将所有这些封装在所有组件共享的通用包中，但那样我已经在自己发明一个框架的道路上了。

是否存在一个适用于这种情况的好框架？

请注意，我特别询问的是Go语言。我想避免使用Hadoop和整个Java堆栈。

编辑：为了清晰起见，添加了一些要点。

英文:

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:

Component A fetches pages.
Component B analyzes pages from A.
Component C stores analyzed bits and pieces from B.

There are obviously more than just three components involved.

Further requirements:

Each component needs to be a separate process (or set of processes).
Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.

This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.

I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.

Another issue with the AMQP approach is that each component will have to have very similar logic for:

Connecting to the queue
Handling connection errors
Serializing/deserializing data into a common format
Running the actual workers (goroutines or forking subprocesses)
Dynamic scaling of workers
Fault tolerance
Node registration
Processing metrics
Queue throttling
Queue prioritization (some workers are less important than others)

…and many other little details that each component will need.

Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.

Does a good framework exist for this kind of stuff?

Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.

Edit: Added some points for clarity.

答案1

得分: 1

因为Go语言具有CSP通道，我建议Go提供一个特殊的机会来实现一个简单、简洁、完全通用的并行框架。使用相对较少的代码，应该能够比大多数现有的框架做得更好。Java和JVM无法做到这一点。

它只需要使用可配置的TCP传输实现通道。这将包括：

写通道端API，包括对读取端预期服务器的一些通用规范
读通道端API，包括监听端口配置和对select的支持
用于传输数据的编组/解组粘合剂 - 可能是encoding/gob

这样一个框架的成功验收测试应该是，使用通道的程序应该能够在多个处理器上分割，并且保持相同的功能行为（即使性能不同）。

Go语言中已经有相当多的现有传输层网络项目。其中值得注意的是ZeroMQ (0MQ)（gozmq，zmq2，zmq3）。

英文:

Because Go has CSP channels, I suggest that Go provides a special opportunity to implement a framework for parallelism that is simple, concise, and yet completely general. It should be possible to do rather better than most existing frameworks with rather less code. Java and the JVM can have nothing like this.

It requires just the implementation of channels using configurable TCP transports. This would consist of

a writing channel-end API, including some general specification of the intended server for the reading end
a reading channel-end API, including listening port configuration and support for select
marshalling/unmarshalling glue to transfer data - probably encoding/gob

A success acceptance test of such a framework should be that a program using channels should be divisible across multiple processors and yet retain the same functional behaviour (even if the performance is different).

There are quite a few existing transport-layer networking projects in Go. Notable is ZeroMQ (0MQ) (gozmq, zmq2, zmq3).

答案2

得分: 0

我猜你正在寻找一个消息队列，比如beanstalkd、RabbitMQ或者ØMQ（发音为zero-MQ）。所有这些工具的本质都是它们提供了先进先出（或非先进先出）队列的推送/接收方法，有些甚至还具有发布/订阅功能。

因此，一个组件将数据放入队列，另一个组件读取。这种方法非常灵活，可以添加或删除组件，并对每个组件进行扩展或缩小。

大多数这些工具已经有了Go语言的库（ØMQ在Gophers中非常受欢迎）和其他语言的库，因此你的额外代码非常少。只需导入一个库，就可以开始接收和推送消息。

为了减少这种开销并避免对特定API的依赖，你可以编写一个简单的封装包，使用其中一个消息队列系统提供非常简单的推送/接收调用，并在所有工具中使用这个包。

英文:

I guess you are looking for a message queue, like beanstalkd, RabbitMQ, or ØMQ (pronounced zero-MQ). The essence of all of these tools is that they provide push/receive methods for FIFO (or non-FIFO) queues and some even have pub/sub.

So, one component puts data in a queue and another one reads. This approach is very flexible in adding or removing components and in scaling each of them up or down.

Most of these tools already have libraries for Go (ØMQ is very popular among Gophers) and other languages, so your overhead code is very little. Just import a library and start receiving and pushing messages.

And to decrease this overhead and avoid dependency on a particular API, you can write a thin package of yours which uses one of these message queue systems to provide very simple push/receive calls and use this package in all of your tools.

答案3

得分: 0

我了解你想避免使用Hadoop+Java，但是除了花时间开发自己的框架之外，你可能想看看Cascading。它在底层MapReduce作业上提供了一层抽象。

在维基百科上最好的总结，它[Cascading]遵循“源-管道-汇”的范例，数据从源头捕获，经过可重用的“管道”执行数据分析过程，结果存储在输出文件或“汇”中。管道是独立于它们将处理的数据创建的。一旦与数据源和汇绑定，它被称为“流”。这些流可以分组成“级联”，进程调度器将确保在满足所有依赖关系之前不执行给定的流。管道和流可以重复使用和重新排序以支持不同的业务需求。

你也可以看看他们的一些示例，日志解析器，日志分析，TF-IDF（尤其是这个流程图）。

英文:

I understand that you want to avoid Hadoop+Java, but instead of spending time developing your own framework, you may want to have a look at Cascading. It provides a layer of abstraction over underlying MapReduce jobs.

Best summarized on Wikipedia, It [Cascading] follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.

You may also want to have a look at some of their examples, Log Parser, Log Analysis, TF-IDF (especially this flow diagram).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有一个分布式数据处理流水线框架，或者一个好的组织方式？

问题

答案1

答案2

答案3

POST参数中有空格的情况下，应该如何处理？

Go mod replace（补丁）只替换一个依赖包的方法是：

谷歌应用引擎Go-Python/Java混合应用程序

Hugo（静态网站生成器）特定URL的列表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。