如何查找特定格式的Spark读/写操作的所有选项?

huangapple go评论66阅读模式
英文:

How to find all the options when read/write with Spark for a specific format?

问题

在使用Spark读取/写入特定格式数据时,是否有一种方法可以找到所有选项?我认为它们必须在源代码中的某个地方,但我找不到它们。

以下是我使用Spark从Hbase读取数据的代码,它运行良好,但我想知道选项 hbase.columns.mappinghbase.table 是从哪里来的。还有其他选项吗?

val spark = SparkSession.builder().master("local").getOrCreate()
val hbaseConf =  HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "vftsandbox-namenode,vftsandbox-snamenode,vftsandbox-node03")

new HBaseContext(spark.sparkContext, hbaseConf)

val hbaseTable = "mytable"
val columnMapping =
  """id STRING :key,
    mycfColumn1 STRING mycf:column1,
    mycfColumn2 STRING mycf:column2,
    mycfCol1 STRING mycf:col1,
    mycfCol3 STRING mycf:col3
  """
val hbaseSource = "org.apache.hadoop.hbase.spark"

val hbaseDF = spark.read.format(hbaseSource)
  .option("hbase.columns.mapping", columnMapping)
  .option("hbase.table", hbaseTable)
  .load()
hbaseDF.show()

我的意思是,如果是 format(csv)format(json),那么互联网上有所有选项的文档,但对于这个特定格式 (org.apache.hadoop.hbase.spark),我没有运气。即使对于csv或json,互联网上的所有选项也必须来自代码,对吧?它们不能仅仅是想象出来的。

现在我认为问题是“如何在源代码中找到所有Spark选项”。我尝试使用IntelliJ Idea搜索工具从所有地方搜索(甚至在源代码库中),但到目前为止没有运气。找不到任何与 hbase.columns.mappinghbase.table 相关的东西(已经尝试了 hbase_columns_mapping ),在 org.apache.hadoop.hbase.spark 中也没有相关的东西,只有在我的代码中有这些实例。

我在运行代码后还在控制台中找到了这些行。但 HbaseRelation 类是一些带有所有 ??? 的“反编译”类。

17:53:51.205 [main] DEBUG org.apache.spark.util.ClosureCleaner - HBaseRelation(Map(hbase.columns.mapping -> id STRING :key,
  mycfColumn1 STRING mycf:column1,
  mycfColumn2 STRING mycf:column2,
  mycfCol1 STRING mycf:col1,
  mycfCol3 STRING mycf:col3
  , hbase.table -> mytable),None)

我认为有可能它只在运行时/编译时出现,但我不确定。

英文:

Is there any way to find all the options when reading/writing with Spark for a specific format? I think they must be in the source code somewhere but I can't find it.

Below is my code to use spark to read data from Hbase, it works fine, but I want to know where the options hbase.columns.mapping and hbase.table come from. Are there any other options?

  val spark = SparkSession.builder().master("local").getOrCreate()
  val hbaseConf =  HBaseConfiguration.create()
  hbaseConf.set("hbase.zookeeper.quorum", "vftsandbox-namenode,vftsandbox-snamenode,vftsandbox-node03")

  new HBaseContext(spark.sparkContext, hbaseConf)

  val hbaseTable = "mytable"
  val columnMapping =
    """id STRING :key,
      mycfColumn1 STRING mycf:column1,
      mycfColumn2 STRING mycf:column2,
      mycfCol1 STRING mycf:col1,
      mycfCol3 STRING mycf:col3
      """
  val hbaseSource = "org.apache.hadoop.hbase.spark"

  val hbaseDF = spark.read.format(hbaseSource)
    .option("hbase.columns.mapping", columnMapping)
    .option("hbase.table", hbaseTable)
    .load()
  hbaseDF.show()

I mean if it's format(csv) or format(json) then there are some docs on the internet with all the options, but for this specific format (org.apache.hadoop.hbase.spark), I have no luck. Even with the case of csv or json, all the options on the internet must come from the code, right? They can't just imagine it out.

Now I think the problem is "how to find all the spark options in the source code in general". I try using IntelliJ Idea search tool to search from all places (even in the source code libraries) but no luck so far. Can't find anything related to hbase.columns.mapping or hbase.table at all (already tried hbase_columns_mapping too), there are no thing related in org.apache.hadoop.hbase.spark either, there are only instances in my code.

如何查找特定格式的Spark读/写操作的所有选项?

I also find these lines in the console after running the code. But the HbaseRelation class is some "decompiled" class with all the ???

17:53:51.205 [main] DEBUG org.apache.spark.util.ClosureCleaner -      HBaseRelation(Map(hbase.columns.mapping -> id STRING :key,
      mycfColumn1 STRING mycf:column1,
      mycfColumn2 STRING mycf:column2,
      mycfCol1 STRING mycf:col1,
      mycfCol3 STRING mycf:col3
      , hbase.table -> mytable),None)

I think there are some possibilities that it only appears at runtime/compile-time but I'm not sure

答案1

得分: 1

因为非内置格式是在任意代码中实现的,所以除了查看可用文档和源代码之外,找到选项的确定方法是不可行的,很遗憾。

例如,按照以下步骤查找HBase连接器选项。

  1. 在线搜索HBase连接器文档/源代码
  2. 注意文档提到了HBaseTableCatalog对象;查看其定义
  3. 注意存储库的自述文件和在线的各种代码片段提到了其他选项,例如hbase.spark.pushdown.columnfilter;找出它们在存储库中的定义位置。在这种情况下,它在HBaseSparkConf对象中定义。

此外,请注意写入和读取操作可能具有不同的选项。

英文:

Because non-built-in formats implemented in arbitrary code, there is no certain way of finding the options other than going through the available documentation and source code unfortunately.

For example, do the steps below to find the HBase Connector options.

  1. Search for the HBase Connector documentation/source code online.
  2. Notice that the documentation mentions the HBaseTableCatalog object; have a look at its definition.
  3. Notice that the repository's readme file and various code snippets online mention other options such as hbase.spark.pushdown.columnfilter; find out where they are defined in the repository. In this case it's defined in the HBaseSparkConf object.

Also, please note that writing and reading operations may have different sets of options.

huangapple
  • 本文由 发表于 2023年6月2日 10:42:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76386823.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定