如何在Azure Databricks中并行运行多个表加载

huangapple go评论52阅读模式
英文:

How to run multiple table loads in parallel in Azure Databrics

问题

我的驱动程序以JSON格式获取表格列表。
例如:["A", "B", "C"]

当前流程:

  1. 列表中的每个表格都是独立的,每个表格都有一个或两个Txn。
  2. 每个表格都按顺序加载,所有提取、Txn和加载都使用Spark数据框操作执行。
  3. 但是,我们希望实现并行加载所有表格,为了实现并行性,我们是否需要更改Azure Databricks中的任何Spark配置?
    注意:
    如果我们明确设置调度程序模式为"公平"并创建新的池,我们会收到错误。
英文:

My driver program gets the list of table in JSON format
Eg:["A","B","C"]

Current process:

  1. Each table in the list are independent and each table has one or
    two Txn .
  2. Each table are loaded in sequentially and all extraction ,Txn and loading are performed using Spark Data frame operations
  3. However we wanted to implement to load all tables in parallel ,To achieve a parallelism, Should we change any configuration spark configuration in azure databrics
    Note:
    If we explicitly set scheduler mode "Fair" with new pool,We are getting the error.

答案1

得分: 0

如果您遍历Scala并行集合,可以并行启动Spark作业。这只是一个虚拟示例:

Seq("A", "B", "C").par.foreach { t =>
  spark.read.text(s"$t.txt").load().write.saveAsTable(t)
}

请注意.par将Seq转换为其并行版本。

英文:

If you iterate through Scala parallel collection, Spark jobs can be launched in parallel. Just a dummy example:

Seq("A", "B", "C").par.foreach { t =>
  spark.read.text(s"$t.txt").load().write.saveAsTable(t)
}

Note the .par that converts Seq into its parallel version.

huangapple
  • 本文由 发表于 2023年5月29日 18:11:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356437.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定