问题

我可以使用16个节点，并且正在使用Spark、Cassandra和Spark-Cassandra连接器(SCC)。我想要从时间的角度评估此集群的性能，当在特定数据上实施特定的统计测试时。因此，在我的一个场景中，我保持了Spark节点的数量为16，并开始向Cassandra环中添加节点。每个添加的Cassandra节点已经安装了Spark，并且通过RepartitionByCassandraReplica(RBCR)确保了数据的本地性。我唯一改变的是复制因子。

时间如下：

Spark - Cassandra节点数量 | 复制因子 | 时间
16 - 1                  | 1       | 1.883 分钟
16 - 2                  | 1       | 2.333 分钟
16 - 3                  | 3       | 0.933 分钟
16 - 4                  | 3       | 0.9 分钟 
...

我的问题是，在第二种情况下，即有2个Cassandra节点时，为什么比第一种情况有1个节点时花费的时间更长。我认为拥有更多的Cassandra节点会导致更多的并行读取。因此，复制因子是否起到作用？如果是这样，它是如何起作用的？

我正在使用RBCR，这意味着当我从Cassandra获取数据时，SCC会从实际存储该数据的节点请求数据。因此，我看不出复制因子会如何影响这一点。

编辑

我认为如果在16 - 2的情况下使用复制因子2，我将获得一个更短的时间，大约为1.5，但这是我目前无法测试的事情。

英文:

I have at my disposal 16 nodes and I am using Spark, Cassandra and Spark-Cassandra Connector(SCC). I want to evaluate the performance of this cluster from time perspective when a specific statistical test is implemented on some specific data. So, in one of my scenarios I kept the Spark nodes up to 16 and started adding nodes to the Cassandra ring. Every Cassandra node that is added, has already a Spark installation, and with the RepartitionByCassandraReplica(RBCR) I make sure that data locality is achieved. The only thing I changed was the replication factor.

The times were as follows:

number of Spark - Cassandra nodes | replication factor | Time
16 - 1                            |        1           | 1.883 min
16 - 2                            |        1           | 2.333 min
16 - 3                            |        3           | 0.933 min
16 - 4                            |        3           | 0.9 min 
...

My question is why in the 2nd case where I have 2 Cassandra nodes it takes more time than the 1st case with 1 node. I thought that the more Cassandra nodes, the more simultaneous reads. So does the replication factor play a role? If so, how?

I am using the RBCR, which means that when I fetch data from Cassandra, the SCC will ask the data from the node that is actually stored in. Therefore, I cannot see how the replication factor affects that.

EDIT

I think that if I had replication factor 2 for the case 16 - 2, I would get a lower time, something like 1.5, but that is something I cannot test right now.

答案1

得分: -1

你的测试似乎存在缺陷。你需要确保 Spark 的工作节点/执行器与 Cassandra 节点之间有一对一的映射。

如你所知，只有当 Spark JVM 和 Cassandra JVM 同时位于相同的操作系统实例（OSI）中时，才能实现数据本地性。在你的环境中，无法保证预定的工作节点/执行器与 Cassandra 节点位于同一 OSI 中。祝好！

英文:

It appears to me that your testing is flawed. You need to have a one-to-one mapping of Spark workers/executors and Cassandra nodes.

As you already know, you can only achieve data locality when BOTH the Spark JVM and Cassandra JVM are co-located in the same operating system instance (OSI). In your environment, there is no guarantee that the scheduled worker/executor are on the same OSI as the Cassandra node. Cheers!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用RepartitionByCassandraReplica时，复制因子的影响是什么？

问题

答案1

Different number of partitions after spark.read & filter depending on Databricks runtime

删除基于另一个pyspark的值的列。

在Spark（2.4及更高版本）中，如何完全“删除”所有敏感信息。

在另一唯一列中，以确保不发生冲突的情况下，随机化主键列的值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论