英文:
How can I implement custom partitioning logic in Cassandra?
问题
我是新手对Cassandra不太熟悉,我正在构建一个聊天应用程序。假设我需要将聊天消息存储在数据库中,我希望使用Cassandra,因为它支持快速写入。我对"Messages"表的数据模型如下:-
message_id
from_user_id
to_user_id
channel_id
message_text
现在,我想以一种方式对其进行分片,以便可以从单个分片中访问特定频道(1:1或群组聊天)的聊天历史。理想情况下,我希望基于channel_id来分片数据库。以下是我面临的挑战:
- 我不能将"channel_id"作为分区键,因为它可以有重复值(一个频道可以有多条消息),而Cassandra不允许分区键由一个列组成的情况有重复值。
- 我可以使用"channel_id+message_id"作为分区键,这将是唯一的,但Cassandra将对channel_id+message_id的组合进行哈希处理,并可能将相同频道的消息放在不同的分片上,这将需要再次进行scatter gather操作。
以下是我的问题:
- 有没有办法覆盖Cassandra(或任何非关系型数据库)的分区/分片逻辑?我是否可以编写自己的分区逻辑,以确定在应用程序逻辑中要写入哪个分片?我知道Redis允许客户端端分片(请参阅此链接)。Cassandra是否也允许类似的操作?
- 一般问题 - 通常,在设计系统时,我们会尝试对数据进行分片,以最小化最频繁查询的scatter gather机会。如果Cassandra(和其他非关系型数据库)不允许分区键中有重复值,那么我们如何才能实现这样的设计?(请注意,使用多列作为分区键也无济于事,因为Cassandra将对这些列值的组合进行哈希处理,并将它们放在不同的分片上,就像上面的示例中所示)。
有人能帮助我了解如何最好地进行建模吗?如果我对此有误解,请纠正我。(正如我所说,我是Cassandra和非关系型数据库的新手)。
英文:
I'm new to Cassandra and I am building a Chat Application. Assuming I have to store the chat messages in a DB, I expect to use Cassandra, since it allows fast writes. My data model for the "Messages" table is the following :-
message_id <br />
from_user_id <br />
to_user_id <br />
channel_id <br />
message_text <br />
Now, I want to shard this in a way so that chat history for a particular channel (1:1 or Group chat) is accessible from a single shard. As such, ideally, I would like to shard the DB based on channel_id. Here are the challenges I have:-
- I cannot make "channel_id" the partition key since it can have duplicates (one channel can have multiple messages) and Cassandra doesn't allow duplicates if partition key is made up of only one column.
- I can use "channel_id+message_id" as the partition key, which will be unique, but then Cassandra will hash the combination of channel_id+message_id and can place messages from the same channel on different shards. So I have to do scatter gather again.
Here are my questions :-
- Is there a way to override Cassandra's (or any non-relational DB) partitioning/sharding logic? Can I write my own partitioning logic so as to determine the shard I want to write to in my application logic? I know Redis allows client side partitioning (See this). Does Cassandra allow something like this too?
- General Question - Usually, when when designing systems, we try to shard our data so as to minimize the chances of scatter gather for the most frequent queries. If Cassandra (and other non-relational DBs) don't allow duplicates in partition keys, how else can we achieve such a design? (Please note that using multiple columns as partition key doesn't help because Cassandra will hash the combination of those column values and will place them on different shards, as presented in the example above).
Could someone help me understand how best to model this? And please correct me if my understanding is wrong. (As mentioned I'm new to Cassandra and non-relational DBs in general)
答案1
得分: 2
- Cassandra分区
在Cassandra中,表的主键由两部分组成:一个分区键(可以是由一列组成的简单键,也可以是由多列组成的复合键)和可选的聚类键(也称为排序键)。
分区键负责在节点之间进行数据分布,即确定“分片”。
聚类键负责在分区内进行数据排序。
示例:
create table example (
col_A text,
col_B text,
col_C text,
col_D text,
col_E text,
col_F text,
PRIMARY KEY((col_A, col_B), col_C, col_D)
);
在这个例子中,col_A和col_B的值确定了分区,而col_C和col_D仍然是主键的一部分,但只定义了该分区内数据的顺序。
在你的情况下,你可以将channel_id作为分区键,然后将message_id作为聚类键,以对该分区(分片)内的记录进行排序。
表可能如下所示:
CREATE TABLE messages (
channel_id text,
message_id text,
from_user_id text,
message_text text,
to_user_id text,
PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC)
示例数据:
select * from messages;
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
2 | m-1 | u-11 | foo bar | u-10
2 | m-10 | u-10 | bar | u-11
2 | m-11 | u-11 | foo | u-10
select * from messages where channel_id = '1';
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
- 自定义分区
在服务器端基于分区键值确定数据放置的东西被称为分区器,并且可以在服务器上的cassandra.yaml文件中进行配置。
基本上,分区器是一个根据其分区键的值派生表示行的标记的函数,通常是通过散列来实现的。然后,数据的每一行通过标记的值分布在整个集群中。
Cassandra提供了以下可以在cassandra.yaml文件中设置的分区器。
- Murmur3Partitioner(默认):根据MurmurHash哈希值均匀分布数据在整个集群中。
- RandomPartitioner:根据MD5哈希值在整个集群中均匀分布数据。
- ByteOrderedPartitioner:通过键字节的词法顺序保持数据的有序分布。
你可以在这里阅读更多关于分区器的信息:https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archPartitionerAbout.html
英文:
- Cassandra partitioning
In Cassandra, table's primary key is made of two parts: a partition key (which can be a simple key made of one column or a compound key made of multiple columns) and optional clustering keys (aka sort keys).
The partition key is responsible for data distribution across your nodes, i.e. determines the "shard".
The clustering key is responsible for data sorting within the partition.
Example:
create table example (
col_A text,
col_B text,
col_C text,
col_D text,
col_E text,
col_F text,
PRIMARY KEY((col_A, col_B), col_C, col_D)
);
In this example, values of col_A and col_B determine the partition, while col_C and col_D are still part of the primary key, but only define the order of data within that partition.
In your case, you can have channel_id as the partition key, and then message_id as clustering key that would sort records within that that partition (shard).
A table might look like this:
CREATE TABLE messages (
channel_id text,
message_id text,
from_user_id text,
message_text text,
to_user_id text,
PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC)
Example data:
select * from messages;
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
2 | m-1 | u-11 | foo bar | u-10
2 | m-10 | u-10 | bar | u-11
2 | m-11 | u-11 | foo | u-10
select * from messages where channel_id = '1';
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
- Custom partitioning
The thing that determines data placement on the server side based on partition key values is called a partitioner and is configurable in the cassandra.yaml file on the server.
Basically, a partitioner is a function for deriving a token representing a row from its partition key, typically by hashing. Each row of data is then distributed across the cluster by the value of the token.
Cassandra offers the following partitioners that can be set in the cassandra.yaml file.
- Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
- RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
- ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes
You can read more about partitioners here: https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archPartitionerAbout.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论