为什么 Spark 模式的 .simpleString() 方法会截断我的输出?

huangapple go评论58阅读模式
英文:

Why is .simpleString() method of spark schema truncating my output?

问题

我有一个非常长的模式,我想将其作为字符串返回。

import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

...

SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName("YourApp").setMaster("local")).getOrCreate();

Dataset<Row> parquetData = spark.read().parquet("/Users/demo/test.parquet");

String schemaString = parquetData.schema().simpleString();

问题在于生成的模式看起来像是(参见“还有10个字段”):

struct<test:struct<countryConfidence:struct<value:double>,... 10 more fields> etc etc>

使用:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_2.12</artifactId>
  <version>3.2.4</version>
</dependency>

是否有一些配置选项可以让 .simpleString 不会截断?我尝试过 parquetData.schema().toDDL(),但它没有打印出我需要的格式。

英文:

I have a very long schema that I want to return as string

import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

...

SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName(&quot;YourApp&quot;).setMaster(&quot;local&quot;)).getOrCreate();

Dataset&lt;Row&gt; parquetData = spark.read().parquet(&quot;/Users/demo/test.parquet&quot;);

String schemaString = parquetData.schema().simpleString();

The problem is the resulting schema looks like (see "10 more fields"):

struct&lt;test:struct&lt;countryConfidence:struct&lt;value:double&gt;,... 10 more fields&gt; etc etc&gt;

Using

&lt;dependency&gt;
  &lt;groupId&gt;org.apache.spark&lt;/groupId&gt;
  &lt;artifactId&gt;spark-sql_2.12&lt;/artifactId&gt;
  &lt;version&gt;3.2.4&lt;/version&gt;
&lt;/dependency&gt;

Is there some configuration option I can use that means .simpleString does not truncate? I've tried parquetData.schema().toDDL(), but it doesn't print the format I need.

答案1

得分: 1

如果你深入查看simpleString方法,你会发现Spark使用了一个truncatedString,其中第三个参数传递了SQLConf.get.maxToStringFields

这个配置的定义如下:

val MAX_TO_STRING_FIELDS = buildConf("spark.sql.debug.maxToStringFields")
  .doc("在调试输出中可以转换为字符串的序列样式条目字段的最大数量。超出限制的任何元素都将被丢弃并替换为“... N个字段”占位符。")
  .version("3.0.0")
  .intConf
  .createWithDefault(25)

解决方案:

spark.sql.debug.maxToStringFields调整为高于25的数字,比如50(任意值,但应根据您的用例确定),例如:

SparkSession spark = SparkSession.builder()
  .appName("Spark应用名称")
  .master("local[*]")
  .config("spark.sql.debug.maxToStringFields", 50)
  .getOrCreate();

祝好运!

英文:

If you take a deeper look inside simpleString method, you can see that Spark uses a truncatedString, where SQLConf.get.maxToStringFields is passed as third argument.

The definition of this configuration is described as:

val MAX_TO_STRING_FIELDS = buildConf(&quot;spark.sql.debug.maxToStringFields&quot;)
  .doc(&quot;Maximum number of fields of sequence-like entries can be converted to strings &quot; +
    &quot;in debug output. Any elements beyond the limit will be dropped and replaced by a&quot; +
    &quot;&quot;&quot; &quot;... N more fields&quot; placeholder.&quot;&quot;&quot;)
  .version(&quot;3.0.0&quot;)
  .intConf
  .createWithDefault(25)

Solution

Tweaking spark.sql.debug.maxToStringFields to a higher number than 25, like 50 (arbitrary, but should be determined by your use case), for example:

SparkSession spark = SparkSession.builder()
  .appName(&quot;Spark app name&quot;)
  .master(&quot;local[*]&quot;)
  .config(&quot;spark.sql.debug.maxToStringFields&quot;, 50)
  .getOrCreate();

Good luck!

huangapple
  • 本文由 发表于 2023年6月29日 01:51:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575614.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定