如何修剪pyspark模式输出

huangapple go评论52阅读模式
英文:

How to trim pyspark schema output

问题

Here is the translated portion:

我的pyspark DataFrame 具有以下模式...

DataFrame[ExternalData: struct<provider:string,data:string,modality:array<string>>]

如果我写(其中sdf是我的pyspark DataFrame)...

sdf.schema

我会得到...

StructType([StructField('ExternalData', StructType([StructField('provider', StringType(), True), StructField('data', StringType(), True), StructField('modality', ArrayType(StringType(), True), True)]), True)])

我如何只获取以下内容?

StructType([StructField('provider', StringType(), True), StructField('data', StringType(), True), StructField('modality', ArrayType(StringType(), True), True)])

有一个细微的差异,即已删除了ExternalData StructTypeStructField。我需要这样做的原因是因为我正在集成parquet与系统,该系统期望以这种格式传递parquet模式,其中ExternalData字段和结构在其他地方传递。

有人有什么建议吗?

英文:

My pyspark dataframe has the following schema...

DataFrame[ExternalData: struct<provider:string,data:string,modality:array<string>>]

If I write (where sdf is my pyspark dataframe)..

sdf.schema

I get...

StructType([StructField('ExternalData', StructType([StructField('provider', StringType(), True), StructField('data', StringType(), True), StructField('modality', ArrayType(StringType(), True), True)]), True)])

How can I get just the below?

StructType([StructField('provider', StringType(), True), StructField('data', StringType(), True), StructField('modality', ArrayType(StringType(), True), True)])

There is a subtle difference in that the ExternalData StructType and StructField has been removed. The reason I need to do this is because the system I'm integrating parquet with expects parquet schema in this format, where ExternalData field and struct is passed elsewhere.

Does anyone have any advice?

答案1

得分: 1

以下是翻译好的内容:

尝试这样做:

您的DataFrame模式:

root
 |-- ExternalData: struct (nullable = true)
 |    |-- provider: string (nullable = true)
 |    |-- data: string (nullable = true)
 |    |-- modality: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

选择所有ExternalData的子列以获得所需的输出

sdf = sdf.select("ExternalData.*")
sdf.printSchema()

输出:

root
 |-- provider: string (nullable = true)
 |-- data: string (nullable = true)
 |-- modality: array (nullable = true)
 |    |-- element: string (containsNull = true)
英文:

Try this:

Your DataFrame schema:

root
 |-- ExternalData: struct (nullable = true)
 |    |-- provider: string (nullable = true)
 |    |-- data: string (nullable = true)
 |    |-- modality: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

Selecting all the sub-columns of ExternalData to get the desired output

sdf = sdf.select("ExternalData.*")
sdf.printSchema()

Output:

root
 |-- provider: string (nullable = true)
 |-- data: string (nullable = true)
 |-- modality: array (nullable = true)
 |    |-- element: string (containsNull = true)

huangapple
  • 本文由 发表于 2023年5月17日 18:55:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271318.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定