使用RepartitionByCassandraReplica时,复制因子的影响是什么?

huangapple go评论102阅读模式
英文:

What is the impact of replication factor using RepartitionByCassandraReplica?

问题

我可以使用16个节点,并且正在使用Spark、Cassandra和Spark-Cassandra连接器(SCC)。我想要从时间的角度评估此集群的性能,当在特定数据上实施特定的统计测试时。因此,在我的一个场景中,我保持了Spark节点的数量为16,并开始向Cassandra环中添加节点。每个添加的Cassandra节点已经安装了Spark,并且通过RepartitionByCassandraReplica(RBCR)确保了数据的本地性。我唯一改变的是复制因子。

时间如下:

Spark - Cassandra节点数量 | 复制因子 | 时间
16 - 1                  | 1       | 1.883 分钟
16 - 2                  | 1       | 2.333 分钟
16 - 3                  | 3       | 0.933 分钟
16 - 4                  | 3       | 0.9 分钟 
...

我的问题是,在第二种情况下,即有2个Cassandra节点时,为什么比第一种情况有1个节点时花费的时间更长。我认为拥有更多的Cassandra节点会导致更多的并行读取。因此,复制因子是否起到作用?如果是这样,它是如何起作用的?

我正在使用RBCR,这意味着当我从Cassandra获取数据时,SCC会从实际存储该数据的节点请求数据。因此,我看不出复制因子会如何影响这一点。

编辑

我认为如果在16 - 2的情况下使用复制因子2,我将获得一个更短的时间,大约为1.5,但这是我目前无法测试的事情。

英文:

I have at my disposal 16 nodes and I am using Spark, Cassandra and Spark-Cassandra Connector(SCC). I want to evaluate the performance of this cluster from time perspective when a specific statistical test is implemented on some specific data. So, in one of my scenarios I kept the Spark nodes up to 16 and started adding nodes to the Cassandra ring. Every Cassandra node that is added, has already a Spark installation, and with the RepartitionByCassandraReplica(RBCR) I make sure that data locality is achieved. The only thing I changed was the replication factor.

The times were as follows:

number of Spark - Cassandra nodes | replication factor | Time
16 - 1                            |        1           | 1.883 min
16 - 2                            |        1           | 2.333 min
16 - 3                            |        3           | 0.933 min
16 - 4                            |        3           | 0.9 min 
...

My question is why in the 2nd case where I have 2 Cassandra nodes it takes more time than the 1st case with 1 node. I thought that the more Cassandra nodes, the more simultaneous reads. So does the replication factor play a role? If so, how?

I am using the RBCR, which means that when I fetch data from Cassandra, the SCC will ask the data from the node that is actually stored in. Therefore, I cannot see how the replication factor affects that.

EDIT

I think that if I had replication factor 2 for the case 16 - 2, I would get a lower time, something like 1.5, but that is something I cannot test right now.

答案1

得分: -1

你的测试似乎存在缺陷。你需要确保 Spark 的工作节点/执行器与 Cassandra 节点之间有一对一的映射。

如你所知,只有当 Spark JVM 和 Cassandra JVM 同时位于相同的操作系统实例(OSI)中时,才能实现数据本地性。在你的环境中,无法保证预定的工作节点/执行器与 Cassandra 节点位于同一 OSI 中。祝好!

英文:

It appears to me that your testing is flawed. You need to have a one-to-one mapping of Spark workers/executors and Cassandra nodes.

As you already know, you can only achieve data locality when BOTH the Spark JVM and Cassandra JVM are co-located in the same operating system instance (OSI). In your environment, there is no guarantee that the scheduled worker/executor are on the same OSI as the Cassandra node. Cheers!

huangapple
  • 本文由 发表于 2023年2月27日 16:04:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75578019.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定