Databricks读取Parquet花费的时间太长。

huangapple go评论69阅读模式
英文:

Databricks parquet read taking too long

问题

我有两组不同模式的文件,存储在Azure Blob存储中的Parquet文件中,它们都存储在月/日/小时的子文件夹中。

我需要按小时计划处理这些文件,这意味着我只需加载来自第一个模式的最新文件。但是,我需要连接到第二个模式的记录,该记录可能在过去的任何时间。因此,为了连接到正确的记录,我需要加载第二个模式的整个数据集。

我在使用join之前使用了spark.read.parquet(rootlocation)。这个读取操作花费了很长时间(几乎一个小时)。是否有人知道优化此操作的策略吗?我似乎没有获得任何并行性,因为我只有一个作业。

英文:

I have two separate set of files with different schemas stored in parquet files in Azure blob storage, both are both stored in month/day/hour sub folders.

I need to process the files on an hourly schedule, which means i can just load the most recent files from the 1st schema. However, i need to join onto a record from the 2nd schema which may be at any possible time in the past. So in order to join onto the correct record, i need to load the entire dataset from the 2nd schema.

I'm using spark.read.parquet(rootlocation) before using the result in a join. This read is understandable taking a long time (almost an hour). Does anyone know of any strategies to optimise this? I don't seem to be getting any parallelism as i only have 1 job.

答案1

得分: 2

您可以在读取表格或两个表格时提供模式来加速该过程。否则,Spark 需要发现所有分区、确定模式等等。当您有大量文件(尤其是小文件)时,这可能需要很长时间(还要检查您是否设置了.option("mergeSchema", "true")):

val schema = "col1 long, col2 string, ..."
val df = spark.read.schema(schema).load("path")

或者,您可以切换到从 Parquet 到 Delta Lake 表格 - 在这种情况下,模式存储在 Delta 日志中,可以更快地获取。

英文:

You can speedup the process by providing schemas when reading the table or both tables. Otherwise, Spark will need to discover all partitions, figure out schema, etc., and when you have a lot of files (especially small), then it may take a lot of time (also check that you don't have .option("mergeSchema", "true") set):

val schema = "col1 long, col2 string, ..."
val df = spark.read.schema(schema).load("path")

Alternatively you can switch to Delta Lake tables from Parquet - in that case, schema is stored in the Delta log, and could be fetched much faster.

huangapple
  • 本文由 发表于 2023年7月11日 00:21:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655616.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定