英文:
What difficulties should I expect if I write a NoSQL db using golang but want to run Hadoop mapreduce on it?
问题
我想使用Golang构建一个分布式NoSQL数据库或键值存储,以学习Golang并实践我在学校中学到的分布式系统知识。我能想到的目标用例是在其上运行MapReduce,并实现一个与Hadoop兼容的“文件系统”以将数据暴露给Hadoop,类似于在Ceph和Amazon S3上运行Hadoop。
我的问题是,如果我使用Golang构建系统,将这样的NoSQL数据库与Hadoop集成会遇到什么困难?或者如果我使用Golang构建系统,与其他语言(例如提供Ruby/Python/Node.js/C++ API)集成会有什么困难?
英文:
I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.
答案1
得分: 2
好的,我会为你翻译以下内容:
好的,我不是一个很擅长使用Hadoop的用户,所以我将为你提供一些关于你可能会遇到的问题的一般经验教训:
-
协议。如果你选择使用REST,Go语言是可以的,但是要注意默认的HTTP库的一些问题(比如不会过期的空闲keepalive连接,不一定知道读取器何时关闭流)。但是如果你想要更紧凑的东西,需要知道:a. 我上次检查时,Go的Thrift实现还比较缺乏和相对较慢。b. Go对RPC有很好的支持,但可能与其他语言不兼容。所以你可能想要尝试一下protobuf,或者在redis协议之上进行开发。
-
垃圾回收。Go的垃圾回收机制非常简单(STW,非分代等)。如果你计划在多个GB的内存缓存中进行大量操作,那么可能会出现各种垃圾回收暂停。有一些技术可以减少垃圾回收的压力,但是直接的Go惯用法通常不会针对此进行优化。
-
在Go中进行mmap映射并不直接,所以如果你想要利用它,可能会有一些困难。
-
除了切片、列表和映射之外,你将没有很多内置的数据结构可供使用,比如Set类型。虽然有很多好的实现,但你需要进行一些搜索。
-
花时间学习Go中的并发模式和接口模式。这与其他语言有些不同,作为一个经验法则,如果你发现自己在使用其他语言的模式时遇到困难,那么你可能做错了。我认为这个关于Go并发的演讲很不错:http://www.youtube.com/watch?v=QDDwwePbDtw
以下是一些你可能想要了解的项目:
-
Groupcache - 由Brad Fitzpatrick为Google自己使用编写的分布式键值缓存。这是一个在Go中实现的简单而又非常强大的分布式系统。https://github.com/golang/groupcache,你可以查看Brad关于它的演讲:http://talks.golang.org/2013/oscon-dl.slide
-
InfluxDB - 包括一个基于Go的伟大Raft算法的版本:https://github.com/influxdb/influxdb
-
我自己的项目(已经停止维护),一个兼容redis的数据库,基于插件架构。我的Go语言水平有所提高,但它有一些不错的部分,并且包含一个相当快速的redis协议服务器。https://bitbucket.org/dvirsky/boilerdb
英文:
Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
-
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
-
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
-
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
-
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
-
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
-
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
-
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
-
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论