2023年3月3日 22:48:12go评论99阅读模式

英文:

Spark ETL Large data transfer - how to parallelize

问题

以下是您提供的内容的翻译：

我想要将大量数据从一个数据库迁移到另一个数据库，我已经了解到Spark是一个很好的工具来完成这个任务。我正试图理解Spark的大数据ETL过程和思想。如果有人能解释Spark是如何将数据并行化（或分割）成各个作业的，我将不胜感激。我的主要目标是将数据从BigQuery迁移到Amazon Keyspaces，数据大小约为40GB。

我在这里提供了我已经从网上收集的理解。

这是从BigQuery读取数据的代码。

from pyspark.sql import SparkSession
spark = SparkSession \
  .builder \
  .master('yarn') \
  .appName('spark-bigquery-ks') \
  .getOrCreate()
spark.conf.set("credentialsFile", "./cred.json")
# 从BigQuery加载数据。
df = spark.read.format('bigquery') \
  .option('parentProject','project-id') \
  .option('dataset', 'mydataset') \
  .option('table', 'mytable') \
  .option('query', 'SELECT * from mytable LIMIT 10') \
  .load()
print(df.head())

现在我需要弄清楚转换数据的最佳方式（这非常容易，我可以自己做），但我最重要的问题是关于如何分批处理如此大的数据集。在将这些数据移动到Keyspaces时，是否有什么需要考虑的因素（因为它不适合内存）。

from ssl import SSLContext, CERT_REQUIRED, PROTOCOL_TLSv1_2
import boto3
from boto3 import Session
from cassandra_sigv4.auth import AuthProvider, Authenticator, SigV4AuthProvider
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
ssl_context.load_verify_locations('./AmazonRootCA1.pem')
ssl_context.verify_mode = CERT_REQUIRED
boto_session = boto3.Session(aws_access_key_id="accesstoken",
                             aws_secret_access_key="accesskey",
                             aws_session_token="token",
                             region_name="region")
auth_provider = SigV4AuthProvider(boto_session)
cluster = Cluster(['cassandra.region.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider,
                  port=9142)
session = cluster.connect()

最后，将数据推送到Keyspaces的代码如下：

# 将数据写入Amazon Keyspaces
for index, row in pdf.iterrows():
    keyfamilyid = row["keyfamilyid"]
    recommendedfamilyid = row["recommendedfamilyid"]
    rank = row["rank"]
    chi = row["chi"]
    recommendationtype = row["recommendationtype"]
    title = row["title"]
    location = row["location"]
    typepriority = row["typepriority"]
    customerid = row["customerid"]
    insert_query = f"INSERT INTO {keyspace_name}.{table_name} (keyfamilyid, recommendedfamilyid, rank, chi, recommendationtype, title, location, typepriority, customerid) VALUES ('{keyfamilyid}', '{recommendedfamilyid}', {rank}, {chi}, '{recommendationtype}', '{title}', '{location}', '{typepriority}', '{customerid}')"
    try:
        client.execute(insert_query)
    except ClientError as e:
        print(f"写入第{index}行数据时出错：{e.response['Error']['Message']}")

英文:

I am looking to move a large amount of data from one db to another and I have seen that Spark is a good tool for doing this. I am trying to understand the process and the ideology behind Spark's big data ETL's. Would also appreciate if someone could explain how Spark goes about parallelizing (or splitting) the data in to the various jobs that it spawns. My main aim is to move the data from BigQuery to Amazon Keyspaces - and the data is around 40Gigs.

I am putting here the understanding I have gathered already from online.

This is the code to read the data from Bigquery.

from pyspark.sql import SparkSession
spark = SparkSession \
  .builder \
  .master(&#39;yarn&#39;) \
  .appName(&#39;spark-bigquery-ks&#39;) \
  .getOrCreate()
spark.conf.set(&quot;credentialsFile&quot;, &quot;./cred.json&quot;)
# Load data from BigQuery.
df = spark.read.format(&#39;bigquery&#39;) \
  .option(&#39;parentProject&#39;,&#39;project-id&#39;) \
  .option(&#39;dataset&#39;, &#39;mydataset&#39;) \
  .option(&#39;table&#39;, &#39;mytable&#39;) \
  .option(&#39;query&#39;, &#39;SELECT * from mytable LIMIT 10&#39;) \
  .load()
print(df.head())

Now I need to figure out the best way to transform the data (which is super easy and I can do that myself) - but my most important question is regarding the batching and handling of such a large data set. Are there any considerations that I need to have to move such data ( which wont fit in memory ) to KeySpaces.

from ssl import SSLContext, CERT_REQUIRED, PROTOCOL_TLSv1_2
import boto3
from boto3 import Session
from cassandra_sigv4.auth import AuthProvider, Authenticator, SigV4AuthProvider
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
ssl_context.load_verify_locations(&#39;./AmazonRootCA1.pem&#39;)
ssl_context.verify_mode = CERT_REQUIRED
boto_session = boto3.Session(aws_access_key_id=&quot;accesstoken&quot;,
                             aws_secret_access_key=&quot;accesskey&quot;,
                             aws_session_token=&quot;token&quot;,
                             region_name=&quot;region&quot;)
auth_provider = SigV4AuthProvider(boto_session)
cluster = Cluster([&#39;cassandra.region.amazonaws.com&#39;], ssl_context=ssl_context, auth_provider=auth_provider,
                  port=9142)
session = cluster.connect()

and finally to push the data to keyspaces would look something like this code.

# Write data to Amazon Keyspaces
for index, row in pdf.iterrows():
    keyfamilyid = row[&quot;keyfamilyid&quot;]
    recommendedfamilyid = row[&quot;recommendedfamilyid&quot;]
    rank = row[&quot;rank&quot;]
    chi = row[&quot;chi&quot;]
    recommendationtype = row[&quot;recommendationtype&quot;]
    title = row[&quot;title&quot;]
    location = row[&quot;location&quot;]
    typepriority = row[&quot;typepriority&quot;]
    customerid = row[&quot;customerid&quot;]
    insert_query = f&quot;INSERT INTO {keyspace_name}.{table_name} (keyfamilyid, recommendedfamilyid, rank, chi, recommendationtype, title, location, typepriority, customerid) VALUES (&#39;{keyfamilyid}&#39;, &#39;{recommendedfamilyid}&#39;, {rank}, {chi}, &#39;{recommendationtype}&#39;, &#39;{title}&#39;, &#39;{location}&#39;, &#39;{typepriority}&#39;, &#39;{customerid}&#39;)&quot;
    try:
        client.execute(insert_query)
    except ClientError as e:
        print(f&quot;Error writing data for row {index}: {e.response[&#39;Error&#39;][&#39;Message&#39;]}&quot;)

答案1

得分: 2

您可以使用Spark Cassandra连接器将数据复制到Amazon Keyspaces。在以下示例中，我从S3写入Keyspaces，但对于BigQuery，操作方式将类似。写入Keyspaces的最重要的事情是要执行以下操作：

使用预配置容量模式预热表格。为了确保表格在初始写入速率下有足够的资源，分配大量的容量。
从Bigtable中随机排列数据，因为它很可能是按排序顺序导出的。在插入到NoSQL时，应以随机访问模式写入。

以下是代码示例的一部分，用于实现上述操作：

object GlueApp {
  def main(sysArgs: Array[String]) {
    // 代码内容...
    
    val orderedData = sparkSession.read.format(backupFormat).load(s3bucketBackupsLocation)
    // 在加载之前随机化数据，以最大化表格吞吐量并避免写入节流事件
    // 从其他数据库或Cassandra导出的数据可能按主键排序
    // 在Amazon Keyspaces中，您需要以随机方式加载数据以使用所有可用资源
    // 以下命令将随机化数据
    val shuffledData = orderedData.orderBy(rand())
    shuffledData.write.format("org.apache.spark.sql.cassandra").mode("append").option("keyspace", keyspaceName).option("table", tableName).save()
    Job.commit()
  }
}

您可以在这里找到完整示例代码的链接。

请注意，这是代码的部分示例，用于说明如何将数据复制到Amazon Keyspaces，并且需要根据您的具体需求进行适当的设置和配置。

英文:

You can use the spark cassandra connector to copy data to Amazon Keyspaces. In the following example I write to Keyspaces from S3 but big query would be similar. The biggest thing with writing to Keyspaces would be to do the following.

Prewarm the tables using provisioned capacity mode. Provision a high amount to ensure the table has enough resources at intial write rate.
Shuffle the data from bigtable since it would most likely be exported in sorted order. Inserting to NoSQL you should write in a random access pattern

object GlueApp {
  def main(sysArgs: Array[String]) {
  val args = GlueArgParser.getResolvedOptions(sysArgs, Seq(&quot;JOB_NAME&quot;, &quot;KEYSPACE_NAME&quot;, &quot;TABLE_NAME&quot;, &quot;DRIVER_CONF&quot;, &quot;FORMAT&quot;, &quot;S3_URI&quot;).toArray)
  val driverConfFileName = args(&quot;DRIVER_CONF&quot;)
  val conf = new SparkConf()
      .setAll(
       Seq(
           (&quot;spark.task.maxFailures&quot;,  &quot;10&quot;),
             
          
          (&quot;spark.cassandra.connection.config.profile.path&quot;,  driverConfFileName),
          (&quot;spark.cassandra.query.retry.count&quot;, &quot;1000&quot;),
          (&quot;spark.cassandra.output.consistency.level&quot;,  &quot;LOCAL_QUORUM&quot;),
          (&quot;spark.cassandra.sql.inClauseToJoinConversionThreshold&quot;, &quot;0&quot;),
          (&quot;spark.cassandra.sql.inClauseToFullScanConversionThreshold&quot;, &quot;0&quot;),
          (&quot;spark.cassandra.concurrent.reads&quot;, &quot;512&quot;),
          (&quot;spark.cassandra.output.concurrent.writes&quot;, &quot;15&quot;),
          (&quot;spark.cassandra.output.batch.grouping.key&quot;, &quot;none&quot;),
          (&quot;spark.cassandra.output.batch.size.rows&quot;, &quot;1&quot;)
      ))
    val spark: SparkContext = new SparkContext(conf)
    val glueContext: GlueContext = new GlueContext(spark)
    val sparkSession: SparkSession = glueContext.getSparkSession
    import sparkSession.implicits._
    Job.init(args(&quot;JOB_NAME&quot;), glueContext, args.asJava)
    
    val logger = new GlueLogger
    
    //validation steps for peers and partitioner 
    val connector = CassandraConnector.apply(conf);
    val session = connector.openSession();
    val peersCount = session.execute(&quot;SELECT * FROM system.peers&quot;).all().size()
    
    val partitioner = session.execute(&quot;SELECT partitioner from system.local&quot;).one().getString(&quot;partitioner&quot;)
    
    logger.info(&quot;Total number of seeds:&quot; + peersCount);
    logger.info(&quot;Configured partitioner:&quot; + partitioner);
    
    if(peersCount == 0){
       throw new Exception(&quot;No system peers found. Check required permissions to read from the system.peers table. If using VPCE check permissions for describing VPCE endpoints. https://docs.aws.amazon.com/keyspaces/latest/devguide/vpc-endpoints.html&quot;)
    }
    
    if(partitioner.equals(&quot;com.amazonaws.cassandra.DefaultPartitioner&quot;)){
        throw new Exception(&quot;Sark requires the use of RandomPartitioner or Murmur3Partitioner. See Working with partioners in Amazon Keyspaces documentation. https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-partitioners.html&quot;)
    }
    
    val tableName = args(&quot;TABLE_NAME&quot;)
    val keyspaceName = args(&quot;KEYSPACE_NAME&quot;)
    val backupFormat = args(&quot;FORMAT&quot;)
    val s3bucketBackupsLocation = args(&quot;S3_URI&quot;)	
    val orderedData = sparkSession.read.format(backupFormat).load(s3bucketBackupsLocation)	
   //You want randomize data before loading to maximize table throughput and avoid WriteThottleEvents	
   //Data exported from another database or Cassandra may be ordered by primary key.	
   //With Amazon Keyspaces you want to load data in a random way to use all available resources.	
   //The following command will randomize the data.	
   val shuffledData = orderedData.orderBy(rand())	
   shuffledData.write.format(&quot;org.apache.spark.sql.cassandra&quot;).mode(&quot;append&quot;).option(&quot;keyspace&quot;, keyspaceName).option(&quot;table&quot;, tableName).save()
   Job.commit()
  }
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark ETL大数据传输 – 如何并行化

问题

答案1

大振幅值在使用 librosa 加载 .wav 文件时会发生什么？

显示文件夹中的随机文件与Kivy（Python）

How to import auth.User on django-import/export

Python TypeError: ‘datetime.datetime’ 对象不可订阅。无法打印所需的数字。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。