英文:
How to run multiple table loads in parallel in Azure Databrics
问题
我的驱动程序以JSON格式获取表格列表。
例如:["A", "B", "C"]
当前流程:
- 列表中的每个表格都是独立的,每个表格都有一个或两个Txn。
- 每个表格都按顺序加载,所有提取、Txn和加载都使用Spark数据框操作执行。
- 但是,我们希望实现并行加载所有表格,为了实现并行性,我们是否需要更改Azure Databricks中的任何Spark配置?
注意:
如果我们明确设置调度程序模式为"公平"并创建新的池,我们会收到错误。
英文:
My driver program gets the list of table in JSON format
Eg:["A","B","C"]
Current process:
- Each table in the list are independent and each table has one or
two Txn . - Each table are loaded in sequentially and all extraction ,Txn and loading are performed using Spark Data frame operations
- However we wanted to implement to load all tables in parallel ,To achieve a parallelism, Should we change any configuration spark configuration in azure databrics
Note:
If we explicitly set scheduler mode "Fair" with new pool,We are getting the error.
答案1
得分: 0
如果您遍历Scala并行集合,可以并行启动Spark作业。这只是一个虚拟示例:
Seq("A", "B", "C").par.foreach { t =>
spark.read.text(s"$t.txt").load().write.saveAsTable(t)
}
请注意.par
将Seq转换为其并行版本。
英文:
If you iterate through Scala parallel collection, Spark jobs can be launched in parallel. Just a dummy example:
Seq("A", "B", "C").par.foreach { t =>
spark.read.text(s"$t.txt").load().write.saveAsTable(t)
}
Note the .par
that converts Seq into its parallel version.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论