英文:
How to read several file types in spark?
问题
我想读取不同类型的文件。我可以在一个Spark操作中完成吗?即在没有像这样的循环的情况下:
from pyspark.shell import spark
load_folder = '...'
general_df = None
for extension in ("*.txt", "*.inf"):
df = spark.read.format("text") \
.option("pathGlobFilter", extension) \
.option("recursiveFileLookup", "true") \
.load(load_folder)
if general_df is None:
general_df = df
else:
general_df = general_df.union(df)
general_df.show()
英文:
I want to read files of several different types. Can I do it in one spark operation? I.e. do it without a loop like this:
from pyspark.shell import spark
load_folder = '...'
general_df = None
for extension in ("*.txt", "*.inf"):
df = spark.read.format("text") \
.option("pathGlobFilter", extension) \
.option("recursiveFileLookup", "true") \
.load(load_folder)
if general_df is None:
general_df = df
else:
general_df = general_df.union(df)
general_df.show()
答案1
得分: 1
这应该可以工作:
from pyspark.shell import spark
load_folder = '...'
df = spark.read.format("text") \
.option("pathGlobFilter", "{*.txt,*.inf}") \
.option("recursiveFileLookup", "true") \
.load(load_folder)
df.show()
pathGlobFilter
选项是基于 org.apache.hadoop.fs.GlobFilter
进行工作,似乎是基于 bash globbing 的。事实上,花括号在这里用于表示多个可能的匹配项。
{*.txt,*.inf}
基本上相当于 *.(txt|inf)
更多信息:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/
英文:
This should work:
from pyspark.shell import spark
load_folder = '...'
df = spark.read.format("text") \
.option("pathGlobFilter", "{*.txt,*.inf}") \
.option("recursiveFileLookup", "true") \
.load(load_folder)
df .show()
pathGlobFilter option works based on org.apache.hadoop.fs.GlobFilter that seems to be based on bash globbing. Indeed, curly brackets is used there to have multiple possible matches.
{*.txt,*.inf} basically is like *.(txt|inf)
More info:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-glob-filter
https://linuxhint.com/bash_globbing_tutorial/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论