英文:
What is the impact of replication factor using RepartitionByCassandraReplica?
问题
我可以使用16个节点,并且正在使用Spark、Cassandra和Spark-Cassandra连接器(SCC)。我想要从时间的角度评估此集群的性能,当在特定数据上实施特定的统计测试时。因此,在我的一个场景中,我保持了Spark节点的数量为16,并开始向Cassandra环中添加节点。每个添加的Cassandra节点已经安装了Spark,并且通过RepartitionByCassandraReplica(RBCR)确保了数据的本地性。我唯一改变的是复制因子。
时间如下:
Spark - Cassandra节点数量 | 复制因子 | 时间
16 - 1 | 1 | 1.883 分钟
16 - 2 | 1 | 2.333 分钟
16 - 3 | 3 | 0.933 分钟
16 - 4 | 3 | 0.9 分钟
...
我的问题是,在第二种情况下,即有2个Cassandra节点时,为什么比第一种情况有1个节点时花费的时间更长。我认为拥有更多的Cassandra节点会导致更多的并行读取。因此,复制因子是否起到作用?如果是这样,它是如何起作用的?
我正在使用RBCR,这意味着当我从Cassandra获取数据时,SCC会从实际存储该数据的节点请求数据。因此,我看不出复制因子会如何影响这一点。
编辑
我认为如果在16 - 2的情况下使用复制因子2,我将获得一个更短的时间,大约为1.5,但这是我目前无法测试的事情。
英文:
I have at my disposal 16 nodes and I am using Spark, Cassandra and Spark-Cassandra Connector(SCC). I want to evaluate the performance of this cluster from time perspective when a specific statistical test is implemented on some specific data. So, in one of my scenarios I kept the Spark nodes up to 16 and started adding nodes to the Cassandra ring. Every Cassandra node that is added, has already a Spark installation, and with the RepartitionByCassandraReplica(RBCR) I make sure that data locality is achieved. The only thing I changed was the replication factor.
The times were as follows:
number of Spark - Cassandra nodes | replication factor | Time
16 - 1 | 1 | 1.883 min
16 - 2 | 1 | 2.333 min
16 - 3 | 3 | 0.933 min
16 - 4 | 3 | 0.9 min
...
My question is why in the 2nd case where I have 2 Cassandra nodes it takes more time than the 1st case with 1 node. I thought that the more Cassandra nodes, the more simultaneous reads. So does the replication factor play a role? If so, how?
I am using the RBCR, which means that when I fetch data from Cassandra, the SCC will ask the data from the node that is actually stored in. Therefore, I cannot see how the replication factor affects that.
EDIT
I think that if I had replication factor 2 for the case 16 - 2, I would get a lower time, something like 1.5, but that is something I cannot test right now.
答案1
得分: -1
你的测试似乎存在缺陷。你需要确保 Spark 的工作节点/执行器与 Cassandra 节点之间有一对一的映射。
如你所知,只有当 Spark JVM 和 Cassandra JVM 同时位于相同的操作系统实例(OSI)中时,才能实现数据本地性。在你的环境中,无法保证预定的工作节点/执行器与 Cassandra 节点位于同一 OSI 中。祝好!
英文:
It appears to me that your testing is flawed. You need to have a one-to-one mapping of Spark workers/executors and Cassandra nodes.
As you already know, you can only achieve data locality when BOTH the Spark JVM and Cassandra JVM are co-located in the same operating system instance (OSI). In your environment, there is no guarantee that the scheduled worker/executor are on the same OSI as the Cassandra node. Cheers!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论