在Spark中如何读取多种文件类型?

huangapple go评论111阅读模式
英文:

How to read several file types in spark?

问题

我想读取不同类型的文件。我可以在一个Spark操作中完成吗?即在没有像这样的循环的情况下:

  1. from pyspark.shell import spark
  2. load_folder = '...'
  3. general_df = None
  4. for extension in ("*.txt", "*.inf"):
  5. df = spark.read.format("text") \
  6. .option("pathGlobFilter", extension) \
  7. .option("recursiveFileLookup", "true") \
  8. .load(load_folder)
  9. if general_df is None:
  10. general_df = df
  11. else:
  12. general_df = general_df.union(df)
  13. general_df.show()
英文:

I want to read files of several different types. Can I do it in one spark operation? I.e. do it without a loop like this:

  1. from pyspark.shell import spark
  2. load_folder = '...'
  3. general_df = None
  4. for extension in ("*.txt", "*.inf"):
  5. df = spark.read.format("text") \
  6. .option("pathGlobFilter", extension) \
  7. .option("recursiveFileLookup", "true") \
  8. .load(load_folder)
  9. if general_df is None:
  10. general_df = df
  11. else:
  12. general_df = general_df.union(df)
  13. general_df.show()

答案1

得分: 1

这应该可以工作:

  1. from pyspark.shell import spark
  2. load_folder = '...'
  3. df = spark.read.format("text") \
  4. .option("pathGlobFilter", "{*.txt,*.inf}") \
  5. .option("recursiveFileLookup", "true") \
  6. .load(load_folder)
  7. df.show()

pathGlobFilter 选项是基于 org.apache.hadoop.fs.GlobFilter 进行工作,似乎是基于 bash globbing 的。事实上,花括号在这里用于表示多个可能的匹配项。

{*.txt,*.inf} 基本上相当于 *.(txt|inf)

更多信息:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/

英文:

This should work:

  1. from pyspark.shell import spark
  2. load_folder = '...'
  3. df = spark.read.format("text") \
  4. .option("pathGlobFilter", "{*.txt,*.inf}") \
  5. .option("recursiveFileLookup", "true") \
  6. .load(load_folder)
  7. df .show()

pathGlobFilter option works based on org.apache.hadoop.fs.GlobFilter that seems to be based on bash globbing. Indeed, curly brackets is used there to have multiple possible matches.

  1. {*.txt,*.inf} basically is like *.(txt|inf)

More info:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/

huangapple
  • 本文由 发表于 2023年7月3日 23:03:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605956.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定