2020年1月3日 22:29:19go评论134阅读模式

英文:

Using map() and filter() in Spark instead of spark.sql

问题

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?

这是使用SPARK SQL的代码：

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession
    
    object hello {
      def main(args: Array[String]): Unit = {
    
        val conf = new SparkConf()
          .setMaster("local")
          .setAppName("quest9")
    
        val sc = new SparkContext(conf)
        val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
    
        val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
        val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
    
        census.createOrReplaceTempView("census")
        zip_codes.createOrReplaceTempView("zip")
    
        //val query = spark.sql("SELECT * FROM census")
    
        val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
    
        query.show()
    
        query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
    
        sc.stop()
      }
    }

(Note: I have removed the HTML entities for quotes in the code for clarity.)

英文:

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?

This is my code using the SPARK SQL:

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession
    
    object hello {
      def main(args: Array[String]): Unit = {
    
        val conf = new SparkConf()
          .setMaster(&quot;local&quot;)
          .setAppName(&quot;quest9&quot;)
    
        val sc = new SparkContext(conf)
        val spark = SparkSession.builder().appName(&quot;quest9&quot;).master(&quot;local&quot;).getOrCreate()
    
        val zip_codes = spark.read.format(&quot;csv&quot;).option(&quot;header&quot;, &quot;true&quot;).load(&quot;/home/hdfs/Documents/quest_9/doc/zip.csv&quot;)
        val census = spark.read.format(&quot;csv&quot;).option(&quot;header&quot;, &quot;true&quot;).load(&quot;/home/hdfs/Documents/quest_9/doc/census.csv&quot;)
    
        census.createOrReplaceTempView(&quot;census&quot;)
        zip_codes.createOrReplaceTempView(&quot;zip&quot;)
    
        //val query = spark.sql(&quot;SELECT * FROM census&quot;)
    
        val query = spark.sql(&quot;SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = &#39;Inglewood&#39; AND zip.County = &#39;Los Angeles&#39;&quot;)
    
        query.show()
    
        query.write.parquet(&quot;/home/hdfs/Documents/population/census/IDE/census.parquet&quot;)
    
        sc.stop()
      }
    }

答案1

得分: 1

Here's the translated content without the code parts:

唯一通常情况下执行此操作的明智方式是使用Dataset的join()方法。我建议您质疑是否有必要仅使用map/filter来执行此操作，因为这不直观，可能会让任何有经验的Spark开发人员感到困惑（或者简单地说，让他们翻白眼）。这也可能导致数据集增长时的可扩展性问题。

话虽如此，在您的用例中，避免使用join非常简单。另一种可能性是向Spark发出两个单独的作业：

获取您感兴趣的邮政编码
在该（些）邮政编码上过滤人口普查数据

第1步，收集感兴趣的邮政编码（不确定确切的语法，因为我手头没有Spark Shell，但应该很容易找到正确的语法）。

然后，以这种方式编写它，而不是使用select/where，对于习惯于Spark的人来说可能会感到奇怪。

不过，这种方法之所以有效，是因为我们可以确保与给定城镇和县匹配的邮政编码非常少。因此，安全地在驱动程序端收集结果。

现在进入第2步：

census.filter(row =&gt; codes.contains(row.getAs[String](&quot;Zip_Code&quot;)))
      .map( /* 获取您的数据的方法 */ )

英文:

The only sensible way, in general to do this would be to use the join() method of `Dataset̀. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.

That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :

fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)

Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).

var codes: Seq[String] = zip_codes
             // filter on the city
             .filter(row =&gt; row.getAs[String](&quot;City&quot;).equals(&quot;Inglewood&quot;))
             // filter on the county
             .filter(row =&gt; row.getAs[String](&quot;County&quot;).equals(&quot;Los Angeles&quot;))
             // map to zip code as a String
             .map(row =&gt; row.getAs[String](&quot;Zip_Code&quot;))
             .as[String]
             // Collect on the driver side
             .collect()

Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.

Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.

Now on to step 2 :

census.filter(row =&gt; codes.contains(row.getAs[String](&quot;Zip_Code&quot;)))
      .map( /* whatever to get your data out */ )

答案2

得分: 0

你所需的是一个 join 操作，您的查询大致翻译为：

census.as("census")
  .join(
    broadcast(zip_codes
        .where($"City"==="Inglewood")
        .where($"County"==="Los Angeles")
      .as("zip"))
    ,Seq("Zip_Code"),
    "inner" // "leftsemi" 也可以
  )
  .select(
    $"census.Total_Males".as("male"),
    $"census.Total_Females".as("female")
  ).distinct()

英文:

What you need is a join, your query roughly translates to :

census.as(&quot;census&quot;)
  .join(
    broadcast(zip_codes
        .where($&quot;City&quot;===&quot;Inglewood&quot;)
        .where($&quot;County&quot;===&quot;Los Angeles&quot;)
      .as(&quot;zip&quot;))
    ,Seq(&quot;Zip_Code&quot;),
    &quot;inner&quot; // &quot;leftsemi&quot; would also be sufficient
  )
  .select(
    $&quot;census.Total_Males&quot;.as(&quot;male&quot;),
    $&quot;census.Total_Females&quot;.as(&quot;female&quot;)
  ).distinct()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Spark中的map()和filter()而不是spark.sql。

问题

答案1

答案2

当启用动态分配时，Spark的执行者数量

基于路径的规则导致 AKS Ingress 控制器返回 404 错误。

通过在Spark中使用数据框列来定义关系类型的方式是什么？

阅读由Spark Redis保存的数据，使用Java。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论