为什么 Spark 模式的 .simpleString() 方法会截断我的输出?

huangapple go评论93阅读模式
英文:

Why is .simpleString() method of spark schema truncating my output?

问题

我有一个非常长的模式,我想将其作为字符串返回。

  1. import org.apache.spark.SparkConf;
  2. import org.apache.spark.sql.SparkSession;
  3. import org.apache.spark.sql.Dataset;
  4. import org.apache.spark.sql.Row;
  5. ...
  6. SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName("YourApp").setMaster("local")).getOrCreate();
  7. Dataset<Row> parquetData = spark.read().parquet("/Users/demo/test.parquet");
  8. String schemaString = parquetData.schema().simpleString();

问题在于生成的模式看起来像是(参见“还有10个字段”):

  1. struct<test:struct<countryConfidence:struct<value:double>,... 10 more fields> etc etc>

使用:

  1. <dependency>
  2. <groupId>org.apache.spark</groupId>
  3. <artifactId>spark-sql_2.12</artifactId>
  4. <version>3.2.4</version>
  5. </dependency>

是否有一些配置选项可以让 .simpleString 不会截断?我尝试过 parquetData.schema().toDDL(),但它没有打印出我需要的格式。

英文:

I have a very long schema that I want to return as string

  1. import org.apache.spark.SparkConf;
  2. import org.apache.spark.sql.SparkSession;
  3. import org.apache.spark.sql.Dataset;
  4. import org.apache.spark.sql.Row;
  5. ...
  6. SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName(&quot;YourApp&quot;).setMaster(&quot;local&quot;)).getOrCreate();
  7. Dataset&lt;Row&gt; parquetData = spark.read().parquet(&quot;/Users/demo/test.parquet&quot;);
  8. String schemaString = parquetData.schema().simpleString();

The problem is the resulting schema looks like (see "10 more fields"):

  1. struct&lt;test:struct&lt;countryConfidence:struct&lt;value:double&gt;,... 10 more fields&gt; etc etc&gt;

Using

  1. &lt;dependency&gt;
  2. &lt;groupId&gt;org.apache.spark&lt;/groupId&gt;
  3. &lt;artifactId&gt;spark-sql_2.12&lt;/artifactId&gt;
  4. &lt;version&gt;3.2.4&lt;/version&gt;
  5. &lt;/dependency&gt;

Is there some configuration option I can use that means .simpleString does not truncate? I've tried parquetData.schema().toDDL(), but it doesn't print the format I need.

答案1

得分: 1

如果你深入查看simpleString方法,你会发现Spark使用了一个truncatedString,其中第三个参数传递了SQLConf.get.maxToStringFields

这个配置的定义如下:

  1. val MAX_TO_STRING_FIELDS = buildConf("spark.sql.debug.maxToStringFields")
  2. .doc("在调试输出中可以转换为字符串的序列样式条目字段的最大数量。超出限制的任何元素都将被丢弃并替换为“... N个字段”占位符。")
  3. .version("3.0.0")
  4. .intConf
  5. .createWithDefault(25)

解决方案:

spark.sql.debug.maxToStringFields调整为高于25的数字,比如50(任意值,但应根据您的用例确定),例如:

  1. SparkSession spark = SparkSession.builder()
  2. .appName("Spark应用名称")
  3. .master("local[*]")
  4. .config("spark.sql.debug.maxToStringFields", 50)
  5. .getOrCreate();

祝好运!

英文:

If you take a deeper look inside simpleString method, you can see that Spark uses a truncatedString, where SQLConf.get.maxToStringFields is passed as third argument.

The definition of this configuration is described as:

  1. val MAX_TO_STRING_FIELDS = buildConf(&quot;spark.sql.debug.maxToStringFields&quot;)
  2. .doc(&quot;Maximum number of fields of sequence-like entries can be converted to strings &quot; +
  3. &quot;in debug output. Any elements beyond the limit will be dropped and replaced by a&quot; +
  4. &quot;&quot;&quot; &quot;... N more fields&quot; placeholder.&quot;&quot;&quot;)
  5. .version(&quot;3.0.0&quot;)
  6. .intConf
  7. .createWithDefault(25)

Solution

Tweaking spark.sql.debug.maxToStringFields to a higher number than 25, like 50 (arbitrary, but should be determined by your use case), for example:

  1. SparkSession spark = SparkSession.builder()
  2. .appName(&quot;Spark app name&quot;)
  3. .master(&quot;local[*]&quot;)
  4. .config(&quot;spark.sql.debug.maxToStringFields&quot;, 50)
  5. .getOrCreate();

Good luck!

huangapple
  • 本文由 发表于 2023年6月29日 01:51:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575614.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定