在Spark中如何读取多种文件类型?

huangapple go评论82阅读模式
英文:

How to read several file types in spark?

问题

我想读取不同类型的文件。我可以在一个Spark操作中完成吗?即在没有像这样的循环的情况下:

from pyspark.shell import spark

load_folder = '...'

general_df = None
for extension in ("*.txt", "*.inf"):
    df = spark.read.format("text") \
        .option("pathGlobFilter", extension) \
        .option("recursiveFileLookup", "true") \
        .load(load_folder)
    if general_df is None:
        general_df = df
    else:
        general_df = general_df.union(df)

general_df.show()
英文:

I want to read files of several different types. Can I do it in one spark operation? I.e. do it without a loop like this:

from pyspark.shell import spark

load_folder = '...'

general_df = None
for extension in ("*.txt", "*.inf"):
    df = spark.read.format("text") \
        .option("pathGlobFilter", extension) \
        .option("recursiveFileLookup", "true") \
        .load(load_folder)
    if general_df is None:
        general_df = df
    else:
        general_df = general_df.union(df)

general_df.show()

答案1

得分: 1

这应该可以工作:

from pyspark.shell import spark

load_folder = '...'

df = spark.read.format("text") \
       .option("pathGlobFilter", "{*.txt,*.inf}") \
       .option("recursiveFileLookup", "true") \
       .load(load_folder)

df.show()

pathGlobFilter 选项是基于 org.apache.hadoop.fs.GlobFilter 进行工作,似乎是基于 bash globbing 的。事实上,花括号在这里用于表示多个可能的匹配项。

{*.txt,*.inf} 基本上相当于 *.(txt|inf)

更多信息:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/

英文:

This should work:

from pyspark.shell import spark

load_folder = '...'

df = spark.read.format("text") \
       .option("pathGlobFilter", "{*.txt,*.inf}") \
       .option("recursiveFileLookup", "true") \
       .load(load_folder)

df .show()

pathGlobFilter option works based on org.apache.hadoop.fs.GlobFilter that seems to be based on bash globbing. Indeed, curly brackets is used there to have multiple possible matches.

{*.txt,*.inf} basically is like *.(txt|inf)

More info:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/

huangapple
  • 本文由 发表于 2023年7月3日 23:03:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76605956.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定