模式在Spark中是如何推断的?

huangapple go评论86阅读模式
英文:

How schema is Inferring in spark?

问题

以下是翻译好的部分:

我有一个包含以下数据的CSV文件:

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69

我试图理解第二段("Snippet 2")是如何工作的?我查阅了Spark的文档,文档中说它只有在对DataFrame执行操作时才会读取/加载数据。

在第一段中,我创建了一个具有错误类型的模式,并且因为我还没有调用任何操作,所以打印出来的结果是正确的。但是在第二段中,我使用了inferSchema而不是自定义模式来加载CSV。现在我的问题是,我如何在不读取数据的情况下获得了正确的模式类型?因为我还没有调用任何操作!请注意,我得到了count字段的整数类型。

代码段 1:

val myManualSchema = new StructType(Array(
      StructField("DEST_COUNTRY_NAME", LongType, true),
      StructField("ORIGIN_COUNTRY_NAME", LongType, true),
      StructField("count", LongType, false) ))

    val csv2010 = getClass.getClassLoader.getResource("2010-summary.csv").getFile
    spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .schema(myManualSchema)
      .load(csv2010).printSchema()
    
    /* 输出:
    root
 |-- DEST_COUNTRY_NAME: long (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: long (nullable = true)
 |-- count: long (nullable = true)
     */

代码段 2:

spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .option("inferSchema", true)
      .load(csv2010).printSchema()
    
    /* 输出:
    root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
     */
英文:

I have one CSV with the following data:

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69

I was trying to understand how is this following 2nd snippet(Snippet 2) working? I went through the documentation of spark which says it will not read/load data until action is called on that dataFrame.

Here I have created schema with wrong types and it's printing correctly as I have not called any action yet. But in 2nd snippet, I have used inferSchema instead of the custom schema to load CSV. Now my question is how I got my correct schema type without going through the data? As I have not called any action yet! Notice that I got integer type at count

> Snippet 1

val myManualSchema = new StructType(Array(
      StructField("DEST_COUNTRY_NAME", LongType, true),
      StructField("ORIGIN_COUNTRY_NAME", LongType, true),
      StructField("count", LongType, false) ))

    val csv2010 = getClass.getClassLoader.getResource("2010-summary.csv").getFile
    spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .schema(myManualSchema)
      .load(csv2010).printSchema()

    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: long (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: long (nullable = true)
 |-- count: long (nullable = true)
     */

> Snippet 2

spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .option("inferSchema", true)
      .load(csv2010).printSchema()

    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
     */

答案1

得分: 1

我不确定文档确切说明了什么,但对于RDDs是有保证的,但实际上由DataFrameReader来避免在加载时进行任何读取。

实际上,在Spark内部的CSV阅读器在inferSchema设置为true时确实读取了数据:请参阅此特定代码行,它调用内部RDD上的aggregate()操作来推断类型。

英文:

I'm not sure how the documentation exactly states it, but, while that's guaranteed for RDDs, it's actually up to the DataFrameReader to avoid making any reads when loading.

In practice, the internal CSV reader in Spark does read the data when the inferSchema is set to true: see this particular line of code which calls the aggregate() action on the internal RDD to infer the types.

huangapple
  • 本文由 发表于 2020年1月3日 18:27:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/59576921.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定