2023年6月30日 02:23:50go评论160阅读模式

英文:

Unable to alter column name for a Hudi table in AWS

问题

I'm unable to alter the column name of Hudi table.
spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") unable to change the column name.
Getting the following error when trying to change the column using above code:
RENAME COLUMN is only supported with v2 tables

英文:

I'm unable to alter the column name of Hudi table .
spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") unbable to change the column name.

A clear and concise description of the problem.

I'm unable to alter the column name of Hudi table .
spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") code is unable to change the column name.

Getting the following error when trying to change the column using above code:
RENAME COLUMN is only supported with v2 tables

To Reproduce

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.JavaConverters._
import scala.collection.mutable

object ReportingJob {

  var spark: SparkSession = _
  var glueContext: GlueContext = _

  def main(inputParams: Array[String]): Unit = {

    val args: Map[String, String] = GlueArgParser.getResolvedOptions(inputParams, Seq(&quot;JOB_NAME&quot;).toArray)
    val sysArgs: mutable.Map[String, String] = scala.collection.mutable.Map(args.toSeq: _*)
   
    implicit val glueContext: GlueContext = init(sysArgs)
    implicit val spark: SparkSession = glueContext.getSparkSession

    import spark.implicits._
     
val partitionColumnName: String = &quot;id&quot;
    val hudiTableName: String = &quot;Customer&quot;
    val preCombineKey: String = &quot;id&quot;
    val recordKey = &quot;id&quot;
    val basePath= &quot;s3://aws-amazon-uk/customer/production/&quot;
    
    
   val df= Seq((123,&quot;1&quot;,&quot;seq1&quot;),(124,&quot;0&quot;,&quot;seq2&quot;)).toDF(&quot;id&quot;,&quot;subid&quot;,&quot;subseq&quot;)
    
      val hudiCommonOptions: Map[String, String] = Map(
        &quot;hoodie.table.name&quot; -&gt; hudiTableName,
        &quot;hoodie.datasource.write.keygenerator.class&quot; -&gt; &quot;org.apache.hudi.keygen.ComplexKeyGenerator&quot;,
        &quot;hoodie.datasource.write.precombine.field&quot; -&gt; preCombineKey,
        &quot;hoodie.datasource.write.recordkey.field&quot; -&gt; recordKey,
        &quot;hoodie.datasource.write.operation&quot; -&gt; &quot;bulk_insert&quot;,
        //&quot;hoodie.datasource.write.operation&quot; -&gt; &quot;upsert&quot;,
        &quot;hoodie.datasource.write.row.writer.enable&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.datasource.write.reconcile.schema&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.datasource.write.partitionpath.field&quot; -&gt; partitionColumnName,
        &quot;hoodie.datasource.write.hive_style_partitioning&quot; -&gt; &quot;true&quot;,
        // &quot;hoodie.bulkinsert.shuffle.parallelism&quot; -&gt; &quot;2000&quot;,
        //  &quot;hoodie.upsert.shuffle.parallelism&quot; -&gt; &quot;400&quot;,
        &quot;hoodie.datasource.hive_sync.enable&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.datasource.hive_sync.table&quot; -&gt; hudiTableName,
        &quot;hoodie.datasource.hive_sync.database&quot; -&gt; &quot;customer_db&quot;,
        &quot;hoodie.datasource.hive_sync.partition_fields&quot; -&gt; partitionColumnName,
        &quot;hoodie.datasource.hive_sync.partition_extractor_class&quot; -&gt; &quot;org.apache.hudi.hive.MultiPartKeysValueExtractor&quot;,
        &quot;hoodie.datasource.hive_sync.use_jdbc&quot; -&gt; &quot;false&quot;,
        &quot;hoodie.combine.before.upsert&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.avro.schema.external.transformation&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.schema.on.read.enable&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.datasource.write.schema.allow.auto.evolution.column.drop&quot; -&gt; &quot;true&quot;,
        &quot;hoodie.index.type&quot; -&gt; &quot;BLOOM&quot;,
        &quot;spark.hadoop.parquet.avro.write-old-list-structure&quot; -&gt; &quot;false&quot;,
        DataSourceWriteOptions.TABLE_TYPE.key() -&gt; &quot;COPY_ON_WRITE&quot;
      )


 
      df.write.format(&quot;org.apache.hudi&quot;)
        .options(hudiCommonOptions)
        .mode(SaveMode.Overwrite)
        .save(basePath+hudiTableName)
		
		spark.sql(&quot;ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier&quot;)
  commit()
  }

  def commit(): Unit = {
    Job.commit()
  }


  def init(sysArgs: mutable.Map[String, String]): GlueContext = {

    val conf = new SparkConf()

    conf.set(&quot;spark.serializer&quot;, &quot;org.apache.spark.serializer.KryoSerializer&quot;)
    conf.set(&quot;spark.sql.legacy.parquet.int96RebaseModeInRead&quot;, &quot;CORRECTED&quot;)
    conf.set(&quot;spark.sql.legacy.parquet.int96RebaseModeInWrite&quot;, &quot;CORRECTED&quot;)
    conf.set(&quot;spark.sql.legacy.parquet.datetimeRebaseModeInRead&quot;, &quot;CORRECTED&quot;)
    conf.set(&quot;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&quot;, &quot;CORRECTED&quot;)
    conf.set(&quot;spark.sql.avro.datetimeRebaseModeInRead&quot;, &quot;CORRECTED&quot;)
    val sparkContext = new SparkContext(conf)
    glueContext = new GlueContext(sparkContext)
    Job.init(sysArgs(&quot;JOB_NAME&quot;), glueContext, sysArgs.asJava)
    glueContext

  }
}

Steps to reproduce the behavior:

I'm using AWS glue job to run the above job.
In Dependent JARs path
hudi-spark3-bundle_2.12-0.12.1
calcite-core-1.16.0
libfb303-0.9.3
Run the above code.

Expected behavior

spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") should be able to rename a column name. Could you suggest any other way to rename the Hudi column name.

A clear and concise description of what you expected to happen.
Change Column name of a hudi table

Environment Description

Hudi version : 0.12.1
Spark version :3.3

Glue Version : 4

Jars used:
hudi-spark3-bundle_2.12-0.12.1
calcite-core-1.16.0
libfb303-0.9.3

Storage (HDFS/S3/GCS..) :S3
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Exception in User Class: org.apache.spark.sql.AnalysisException : RENAME COLUMN is only supported with v2 tables.
at org.apache.spark.sql.errors.QueryCompilationErrors$.operationOnlySupportedWithV2TableError(QueryCompilationErrors.scala:506) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:94) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:138) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:138) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:134) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning$(AnalysisHelper.scala:130) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUpWithPruning(LogicalPlan.scala:30) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:111) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:110) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:49) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:43) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]

答案1

得分: 1

我看到你没有设置 spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 在你的 spark 配置中。这是使用关系 V2 并受益于模式演变功能所需的。

英文:

I see you did't set spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog in your spark conf. This is needed to use relation V2, and benefit from the schema evolution feature.

答案2

得分: 0

因为你正在使用 Glue 4.0，所以实际上不需要添加任何外部的 Hudi jars。它支持 Hudi 版本 0.12.1。
最重要的是，要启用 hudi，你需要添加一个 Glue 作业参数 --datalake-formats，其值为 hudi。
你需要设置 spark.serializer=org.apache.spark.serializer.KyroSerializer 和 spark.sql.hive.convertMetastoreParquet=false，这些参数帮助 Spark 正确处理 Hudi 表，可以在初始化 SparkSession 时将这些配置设置为 SparkConf，或者将它们作为作业参数添加到 --conf，值为 spark.serializer=org.apache.spark.serializer.KyroSerializer --conf spark.sql.hive.convertMetastoreParquet=false。

此外，你可以从 Glue 文档中获取所有这些详细信息。

英文:

So a few things:

As you are using Glue 4.0, you don't really need to add any external hudi jars. It supports Hudi version 0.12.1
Also most importantly, to enable hudi you actually need to add a Glue job parameter --datalake-formats with value hudi
You need to set spark.serializer=org.apache.spark.serializer.KyroSerializer and spark.sql.hive.convertMetastoreParquet=false, these parameters help Spark to handle Hudi tables correctly and these configurations can be set in SparkConf when you are initializing a SparkSession or can add these as job parameters in --conf with value spark.serializer=org.apache.spark.serializer.KyroSerializer --conf spark.sql.hive.convertMetastoreParquet=false

Also, you can get all these details from Glue documentation.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在AWS中更改Hudi表的列名。

问题

答案1

答案2

SQL – 向下滴落超额付款至以下月份

GoneException在从AWS Lambda发送图像到本地计算机时发生。

如何在Docker Compose期间运行`aws s3 sync`？

如何使用Grok解析没有顺序和结构的应用程序日志

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论