英文:
Which is more efficient between the Cassandra's library query and PySpark's Cassandra query?
问题
I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.
The way I approached it is creating a function that uses the Cluster.execute()
method like this:
def select_test_table():
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect('test_keyspace')
r = session.execute('select * from test_keyspace.test_table')
return r
And while researching, I found that I can do that with Spark's own library:
spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()
英文:
I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.
The way I approached it eating a function that uses the Cluster.execute()
method like this:
def select_test_table():
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect('test_keyspace')
r = session.execute('select * from test_keyspace.test_table')
return r
And while researching I found that I can do that with the Spark's own library:
spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()
答案1
得分: 1
你的第一个查询只是一个使用驱动程序的普通全表扫描。这不会很高效,也不是我推荐的做法。相反,使用 Spark。Spark 将查询拆分为分区范围查询,使用多个执行器,并将查询分布到协调器。
顺便说一句,完整的表扫描可能甚至都不会起作用。如果表足够大,它会超时。
英文:
Your first query is just a normal full table scan using the driver. That's not going to be performant and would not be something that I would recommend. Instead, use spark. Spark will break the query down into partition range queries, use multiple executors, and distribute the query across coordinators.
By the way, that full table scan may not even work. If the table is large enough it will timeout.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论