哪个更有效,Cassandra的库查询还是PySpark的Cassandra查询?

huangapple go评论62阅读模式
英文:

Which is more efficient between the Cassandra's library query and PySpark's Cassandra query?

问题

I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.

The way I approached it is creating a function that uses the Cluster.execute() method like this:

def select_test_table():
   cluster = Cluster(['localhost'], port=9042)
   session = cluster.connect('test_keyspace')
   r = session.execute('select * from test_keyspace.test_table')
   return r

And while researching, I found that I can do that with Spark's own library:

spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()
英文:

I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.

The way I approached it eating a function that uses the Cluster.execute() method like this:

def select_test_table():
   cluster = Cluster(['localhost'], port=9042)
   session = cluster.connect('test_keyspace')
   r = session.execute('select * from test_keyspace.test_table')
   return r

And while researching I found that I can do that with the Spark's own library:

spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()

答案1

得分: 1

你的第一个查询只是一个使用驱动程序的普通全表扫描。这不会很高效,也不是我推荐的做法。相反,使用 Spark。Spark 将查询拆分为分区范围查询,使用多个执行器,并将查询分布到协调器。

顺便说一句,完整的表扫描可能甚至都不会起作用。如果表足够大,它会超时。

英文:

Your first query is just a normal full table scan using the driver. That's not going to be performant and would not be something that I would recommend. Instead, use spark. Spark will break the query down into partition range queries, use multiple executors, and distribute the query across coordinators.

By the way, that full table scan may not even work. If the table is large enough it will timeout.

huangapple
  • 本文由 发表于 2023年6月9日 03:04:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76434970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定