2023年6月9日 03:04:09go评论62阅读模式

英文:

Which is more efficient between the Cassandra's library query and PySpark's Cassandra query?

问题

I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.

The way I approached it is creating a function that uses the Cluster.execute() method like this:

def select_test_table():
   cluster = Cluster(['localhost'], port=9042)
   session = cluster.connect('test_keyspace')
   r = session.execute('select * from test_keyspace.test_table')
   return r

And while researching, I found that I can do that with Spark's own library:

spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()

英文:

The way I approached it eating a function that uses the Cluster.execute() method like this:

def select_test_table():
   cluster = Cluster([&#39;localhost&#39;], port=9042)
   session = cluster.connect(&#39;test_keyspace&#39;)
   r = session.execute(&#39;select * from test_keyspace.test_table&#39;)
   return r

And while researching I found that I can do that with the Spark's own library:

spark.read.format(&quot;org.apache.spark.sql.cassandra&quot;).options(table=&quot;test_table&quot;, keyspace=&quot;test_keyspace&quot;).load().collect()

答案1

得分: 1

你的第一个查询只是一个使用驱动程序的普通全表扫描。这不会很高效，也不是我推荐的做法。相反，使用 Spark。Spark 将查询拆分为分区范围查询，使用多个执行器，并将查询分布到协调器。

顺便说一句，完整的表扫描可能甚至都不会起作用。如果表足够大，它会超时。

英文:

Your first query is just a normal full table scan using the driver. That's not going to be performant and would not be something that I would recommend. Instead, use spark. Spark will break the query down into partition range queries, use multiple executors, and distribute the query across coordinators.

By the way, that full table scan may not even work. If the table is large enough it will timeout.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

哪个更有效，Cassandra的库查询还是PySpark的Cassandra查询？

问题

答案1

获取Apache Spark中单列的值，以Java编写，作为一个扁平列表。

UnixTime 在 Spark/Java 中的使用

无法创建与Cassandra的连接会话。

从多行获取数值到单行

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论