在Apache Spark中重命名列后使用它。

huangapple go评论64阅读模式
英文:

Using column after renaming it in Apache Spark

问题

我正在尝试理解为什么Spark在某种相同的情况下表现不同。我重命名了两列并尝试在某些计算中使用它们,但一条语句抛出了无法找到重命名列的错误。以下是代码:

intermediateDF = intermediateDF.drop("GEO.id")
                               .withColumnRenamed("GEO.id2", "id")
                               .withColumnRenamed("GEO.display-label", "label")
                               .withColumn("stateid", functions.expr("int(id/1000)"))
                               .withColumn("countyId", functions.expr("id%1000"))
                               //.withColumn("countyState", functions.split(intermediateDF.col("label"), ","))
                               .withColumnRenamed("rescen42010", "real2010")
                               .drop("resbase42010")
                               .withColumnRenamed("respop72010", "est2010")
                               .withColumnRenamed("respop72011", "est2011")
                               .withColumnRenamed("respop72012", "est2012")
                               .withColumnRenamed("respop72013", "est2013")
                               .withColumnRenamed("respop72014", "est2014")
                               .withColumnRenamed("respop72015", "est2015")
                               .withColumnRenamed("respop72016", "est2016")
                               .withColumnRenamed("respop72017", "est2017");

被注释掉的那一行是引发下面错误的行:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "label" among (GEO.id, GEO.id2, GEO.display-label, rescen42010, resbase42010, respop72010, respop72011, respop72012, respop72013, respop72014, respop72015, respop72016, respop72017);

有人可以帮忙解释为什么Spark能够找到一个重命名的列(从GEO.id2id),对其进行计算,但在另一个列(从GEO.display-labellabel)上失败吗?我正在使用Apache Spark 3和Java。谢谢。

英文:

I am trying to understand why Spark is behaving differently in somewhat same scenario.I renamed two columns and tried to use both of them in some calculation but one statement is throwing en error with unable to find the renamed column .Below is the code

intermediateDF = intermediateDF.drop("GEO.id")
									   .withColumnRenamed("GEO.id2", "id")
									   .withColumnRenamed("GEO.display-label", "label")
									   .withColumn("stateid", functions.expr("int(id/1000)"))
									   .withColumn("countyId", functions.expr("id%1000"))
									   //.withColumn("countyState", functions.split(intermediateDF.col("label"), ","))
									   .withColumnRenamed("rescen42010", "real2010")
									   .drop("resbase42010")
									   .withColumnRenamed("respop72010", "est2010")
									   .withColumnRenamed("respop72011", "est2011")
									   .withColumnRenamed("respop72012", "est2012")
									   .withColumnRenamed("respop72013", "est2013")
									   .withColumnRenamed("respop72014", "est2014")
									   .withColumnRenamed("respop72015", "est2015")
									   .withColumnRenamed("respop72016", "est2016")
									   .withColumnRenamed("respop72017", "est2017")

The line commented out is the one that is throwing below error

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "label" among (GEO.id, GEO.id2, GEO.display-label, rescen42010, resbase42010, respop72010, respop72011, respop72012, respop72013, respop72014, respop72015, respop72016, respop72017);

Can someone please help me out in understanding why Spark can find one renamed column(from GEO.id2 to id), runs calculations on it
but fails on other (from GEO.display-label to label). I am using Apache Spark 3 with Java.Thanks

答案1

得分: 0

尝试使用这个语法:

.withColumn("countyState", functions.split(col("label"), ","))

它应该正常工作。

英文:

Try this syntax:

 .withColumn("countyState", functions.split(col("label"), ","))

It should work just fine.

答案2

得分: 0

这是代码部分,无需翻译:

intermediateDF.select( \
    col("GEO.id2").alias("id"), \
    functions.expr("int(id/1000)").alias("stateid"), \
    functions.expr("id%1000").alias("countyId"), \
    split(col("GEO.display-label"),",").alias("countyState"), \
    col("rescen42010").as("real2010"), \
    col("respop72010").alias("est2010"), \
    col("respop72011").alias("est2011"), \
    col("respop72012").alias("est2012"), \
    col("respop72013").alias("est2013"), \
    col("respop72014").alias("est2014"), \
    col("respop72015").alias("est2015"), \
    col("respop72016").alias("est2016"), \
    col("respop72017").alias("est2017"))
英文:

Check below code.

  intermediateDF.select( \
      col("GEO.id2").alias("id"), \
      functions.expr("int(id/1000)").alias("stateid"), \
      functions.expr("id%1000").alias("countyId"), \
      split(col("GEO.display-label"),",").alias("countyState"), \
      col("rescen42010").as("real2010"), \
      col("respop72010").alias("est2010"), \
      col("respop72011").alias("est2011"), \
      col("respop72012").alias("est2012"), \
      col("respop72013").alias("est2013"), \
      col("respop72014").alias("est2014"), \
      col("respop72015").alias("est2015"), \
      col("respop72016").alias("est2016"), \
      col("respop72017").alias("est2017"))


huangapple
  • 本文由 发表于 2020年7月31日 12:56:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/63186007.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定