英文:
Why is .simpleString() method of spark schema truncating my output?
问题
我有一个非常长的模式,我想将其作为字符串返回。
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
...
SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName("YourApp").setMaster("local")).getOrCreate();
Dataset<Row> parquetData = spark.read().parquet("/Users/demo/test.parquet");
String schemaString = parquetData.schema().simpleString();
问题在于生成的模式看起来像是(参见“还有10个字段”):
struct<test:struct<countryConfidence:struct<value:double>,... 10 more fields> etc etc>
使用:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.4</version>
</dependency>
是否有一些配置选项可以让 .simpleString
不会截断?我尝试过 parquetData.schema().toDDL()
,但它没有打印出我需要的格式。
英文:
I have a very long schema that I want to return as string
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
...
SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName("YourApp").setMaster("local")).getOrCreate();
Dataset<Row> parquetData = spark.read().parquet("/Users/demo/test.parquet");
String schemaString = parquetData.schema().simpleString();
The problem is the resulting schema looks like (see "10 more fields"):
struct<test:struct<countryConfidence:struct<value:double>,... 10 more fields> etc etc>
Using
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.4</version>
</dependency>
Is there some configuration option I can use that means .simpleString
does not truncate? I've tried parquetData.schema().toDDL()
, but it doesn't print the format I need.
答案1
得分: 1
如果你深入查看simpleString
方法,你会发现Spark使用了一个truncatedString
,其中第三个参数传递了SQLConf.get.maxToStringFields
。
这个配置的定义如下:
val MAX_TO_STRING_FIELDS = buildConf("spark.sql.debug.maxToStringFields")
.doc("在调试输出中可以转换为字符串的序列样式条目字段的最大数量。超出限制的任何元素都将被丢弃并替换为“... N个字段”占位符。")
.version("3.0.0")
.intConf
.createWithDefault(25)
解决方案:
将spark.sql.debug.maxToStringFields
调整为高于25的数字,比如50(任意值,但应根据您的用例确定),例如:
SparkSession spark = SparkSession.builder()
.appName("Spark应用名称")
.master("local[*]")
.config("spark.sql.debug.maxToStringFields", 50)
.getOrCreate();
祝好运!
英文:
If you take a deeper look inside simpleString
method, you can see that Spark uses a truncatedString
, where SQLConf.get.maxToStringFields
is passed as third argument.
The definition of this configuration is described as:
val MAX_TO_STRING_FIELDS = buildConf("spark.sql.debug.maxToStringFields")
.doc("Maximum number of fields of sequence-like entries can be converted to strings " +
"in debug output. Any elements beyond the limit will be dropped and replaced by a" +
""" "... N more fields" placeholder.""")
.version("3.0.0")
.intConf
.createWithDefault(25)
Solution
Tweaking spark.sql.debug.maxToStringFields
to a higher number than 25, like 50 (arbitrary, but should be determined by your use case), for example:
SparkSession spark = SparkSession.builder()
.appName("Spark app name")
.master("local[*]")
.config("spark.sql.debug.maxToStringFields", 50)
.getOrCreate();
Good luck!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论