2020年1月3日 18:27:33go评论101阅读模式

英文:

How schema is Inferring in spark?

问题

以下是翻译好的部分：

我有一个包含以下数据的CSV文件：

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69

我试图理解第二段（"Snippet 2"）是如何工作的？我查阅了Spark的文档，文档中说它只有在对DataFrame执行操作时才会读取/加载数据。

在第一段中，我创建了一个具有错误类型的模式，并且因为我还没有调用任何操作，所以打印出来的结果是正确的。但是在第二段中，我使用了inferSchema而不是自定义模式来加载CSV。现在我的问题是，我如何在不读取数据的情况下获得了正确的模式类型？因为我还没有调用任何操作！请注意，我得到了count字段的整数类型。

代码段 1：

val myManualSchema = new StructType(Array(
      StructField("DEST_COUNTRY_NAME", LongType, true),
      StructField("ORIGIN_COUNTRY_NAME", LongType, true),
      StructField("count", LongType, false) ))
    val csv2010 = getClass.getClassLoader.getResource("2010-summary.csv").getFile
    spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .schema(myManualSchema)
      .load(csv2010).printSchema()
    
    /* 输出:
    root
 |-- DEST_COUNTRY_NAME: long (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: long (nullable = true)
 |-- count: long (nullable = true)
     */

代码段 2：

spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .option("inferSchema", true)
      .load(csv2010).printSchema()
    
    /* 输出:
    root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
     */

英文:

I have one CSV with the following data:

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69

I was trying to understand how is this following 2nd snippet(Snippet 2) working? I went through the documentation of spark which says it will not read/load data until action is called on that dataFrame.

Here I have created schema with wrong types and it's printing correctly as I have not called any action yet. But in 2nd snippet, I have used inferSchema instead of the custom schema to load CSV. Now my question is how I got my correct schema type without going through the data? As I have not called any action yet! Notice that I got integer type at count

> Snippet 1

val myManualSchema = new StructType(Array(
      StructField(&quot;DEST_COUNTRY_NAME&quot;, LongType, true),
      StructField(&quot;ORIGIN_COUNTRY_NAME&quot;, LongType, true),
      StructField(&quot;count&quot;, LongType, false) ))
    val csv2010 = getClass.getClassLoader.getResource(&quot;2010-summary.csv&quot;).getFile
    spark.read.format(&quot;csv&quot;)
      .option(&quot;header&quot;, &quot;true&quot;)
      .option(&quot;mode&quot;, &quot;FAILFAST&quot;)
      .schema(myManualSchema)
      .load(csv2010).printSchema()
    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: long (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: long (nullable = true)
 |-- count: long (nullable = true)
     */

> Snippet 2

spark.read.format(&quot;csv&quot;)
      .option(&quot;header&quot;, &quot;true&quot;)
      .option(&quot;mode&quot;, &quot;FAILFAST&quot;)
      .option(&quot;inferSchema&quot;, true)
      .load(csv2010).printSchema()
    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
     */

答案1

得分: 1

我不确定文档确切说明了什么，但对于RDDs是有保证的，但实际上由DataFrameReader来避免在加载时进行任何读取。

实际上，在Spark内部的CSV阅读器在inferSchema设置为true时确实读取了数据：请参阅此特定代码行，它调用内部RDD上的aggregate()操作来推断类型。

英文:

I'm not sure how the documentation exactly states it, but, while that's guaranteed for RDDs, it's actually up to the DataFrameReader to avoid making any reads when loading.

In practice, the internal CSV reader in Spark does read the data when the inferSchema is set to true: see this particular line of code which calls the aggregate() action on the internal RDD to infer the types.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

模式在Spark中是如何推断的？

问题

答案1

How many concurrent tasks in one executor and how Spark handles multithreading among tasks in one executor?

Scala是否可以像Scheme的“count”一样在函数中隐藏状态，因为它也有闭包？

如何将Spark数据集以加密格式保存？

从两个不同的Json文件中获取数据，以在Scala中创建一个Seq[Object]。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。