问题

I can read the Oracle table using this simple Scala program:

val spark = SparkSession
  .builder
  .master("local[4]")
  .config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
  .config("spark.executor.memory", "8g")
  .config("spark.executor.cores", 4)
  .config("spark.task.cpus", 1)
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcl")
  .option("dbtable", "big_table")
  .option("user", "test")
  .option("password", "123456")
  .load()

jdbcDF.show()

However, the table is huge and each node has to read part of it. So, I must use a hash function to distribute data among Spark nodes. To do that in Scala, you can create a predicates list as follows:

val numPartitions = 19 // Set the number of partitions
val partitionKey = "partition_key" // Replace with your actual partition key
val hashCol = "hash_col" // Replace with your actual hash column
val currentDate = "current_date" // Replace with your actual current date

val hashValues = (0 until numPartitions).toList

val predicates = hashValues.map { hashVal =>
  s"""to_date($partitionKey,'YYYYMMDD','nls_calendar=persian') = to_date($currentDate,'YYYYMMDD','nls_calendar=persian') and hash_func($hashCol, $numPartitions) = $hashVal"""
}

// Now you can use the 'predicates' list in your Spark SQL query
val dataframe = spark.read
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .jdbc(
    url = "your_spark_url",
    table = "your_table_name",
    predicates = predicates
  )

This code defines the predicates list in Scala as you explained in Python, and you can use it to read the table based on the specified predicates.

英文:

I can read the Oracle table using this simple Scala program:

   val spark = SparkSession
.builder
.master(&quot;local[4]&quot;)
.config(&quot;spark.sql.sources.partitionColumnTypeInference.enabled&quot;, false)
.config(&quot;spark.executor.memory&quot;, &quot;8g&quot;)
.config(&quot;spark.executor.cores&quot;, 4)
.config(&quot;spark.task.cpus&quot;, 1)
.appName(&quot;Spark SQL basic example&quot;)
.config(&quot;spark.some.config.option&quot;, &quot;some-value&quot;)
.getOrCreate()

val jdbcDF = spark.read
.format(&quot;jdbc&quot;)
.option(&quot;url&quot;, &quot;jdbc:oracle:thin:@x.x.x.x:1521:orcl&quot;)
.option(&quot;dbtable&quot;, &quot;big_table&quot;)
.option(&quot;user&quot;, &quot;test&quot;)
.option(&quot;password&quot;, &quot;123456&quot;)
.load()

 jdbcDF.show()

However, the table is huge and each node have to read part of it. So, I must use a hash function to distribute data among Spark nodes. To have that Spark has Predicates. In fact, I did that in Python. The table has the column named NUM, that Hash Function receives each value and returns an Integer between num_partitions and 0. The predicate list is in following:

 hash_function = lambda x: &#39;ora_hash({}, {})&#39;.format(x, num_partitions)    
 hash_df = connection.read_sql_full(
    &#39;SELECT distinct {0} hash FROM {1}&#39;.format(hash_function(var.hash_col), source_table_name))
hash_values = list(hash_df.loc[:, &#39;HASH&#39;])

hash_values for num_partitions=19 is :

hash_values=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]

predicates = [
    &quot;to_date({0},&#39;YYYYMMDD&#39;,&#39;nls_calendar=persian&#39;)= to_date({1} ,&#39;YYYYMMDD&#39;,&#39;nls_calendar=persian&#39;) &quot; \
    &quot;and hash_func({2},{3}) = {4}&quot;
        .format(partition_key, current_date, hash_col, num_partitions, hash_val) for hash_val in
    hash_values]

Then I read the table based on the predicates like this:

 dataframe = spark.read \
        .option(&#39;driver&#39;, &#39;oracle.jdbc.driver.OracleDriver&#39;) \
        .jdbc(url=spark_url,
              table=table_name,
              predicates=predicates)

Would you please guide me how to create Predicates List in Scala as I explained in Python?

Any help is really appreciated.

答案1

得分: 1

问题已解决。

我将代码更改如下，然后它可以正常工作：

import org.apache.spark.sql.SparkSession
import java.sql.Connection
import oracle.jdbc.pool.OracleDataSource

object main extends App {

  def read_spark(): Unit = {
    val numPartitions = 19
    val partitionColumn = "name"
    val field_date = "test"
    val current_date = "********"
    // 定义JDBC属性
    val url = "jdbc:oracle:thin:@//x.x.x.x:1521/orcl"
    val properties = new java.util.Properties()
    properties.put("url", url)
    properties.put("user", "user")
    properties.put("password", "pass")
    // 定义用于将每行分配到分区的WHERE子句
    val predicateFct = (partition: Int) => s"""ora_hash("$partitionColumn",$numPartitions) = $partition"""
    val predicates = (0 until numPartitions).map{partition => predicateFct(partition)}.toArray

    val test_table = s"(SELECT * FROM table where $field_date=$current_date) dbtable"
    // 将表加载到Spark中
    val df = spark.read
      .format("jdbc")
      .option("driver", "oracle.jdbc.driver.OracleDriver")
      .option("dbtable", test_table)
      .jdbc(url, test_table, predicates, properties)
    df.show()
  }
  val spark = SparkSession
    .builder
    .master("local[4]")
    .config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
    .config("spark.executor.memory", "8g")
    .config("spark.executor.cores", 4)
    .config("spark.task.cpus", 1)
    .appName("Spark SQL基本示例")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()

  read_spark()

}

注意：上述代码中的中文部分已被保留，不进行翻译。

英文:

Problem Solved.

I changed the code like this, then it's work:

import org.apache.spark.sql.SparkSession
import java.sql.Connection
import oracle.jdbc.pool.OracleDataSource
object main extends App {
def read_spark(): Unit = {
val numPartitions = 19
val partitionColumn = &quot;name&quot;
val field_date = &quot;test&quot;
val current_date = &quot;********&quot;
// Define JDBC properties
val url = &quot;jdbc:oracle:thin:@//x.x.x.x:1521/orcl&quot;
val properties = new java.util.Properties()
properties.put(&quot;url&quot;, url)
properties.put(&quot;user&quot;, &quot;user&quot;)
properties.put(&quot;password&quot;, &quot;pass&quot;)
// Define the where clauses to assign each row to a partition
val predicateFct = (partition: Int) =&gt; s&quot;&quot;&quot;ora_hash(&quot;$partitionColumn&quot;,$numPartitions) = $partition&quot;&quot;&quot;
val predicates = (0 until numPartitions).map{partition =&gt; predicateFct(partition)}.toArray
val test_table = s&quot;(SELECT * FROM table where $field_date=$current_date) dbtable&quot;
// Load the table into Spark
val df = spark.read
.format(&quot;jdbc&quot;)
.option(&quot;driver&quot;, &quot;oracle.jdbc.driver.OracleDriver&quot;)
.option(&quot;dbtable&quot;, test_table)
.jdbc(url, test_table, predicates, properties)
df.show()
}
val spark = SparkSession
.builder
.master(&quot;local[4]&quot;)
.config(&quot;spark.sql.sources.partitionColumnTypeInference.enabled&quot;, false)
.config(&quot;spark.executor.memory&quot;, &quot;8g&quot;)
.config(&quot;spark.executor.cores&quot;, 4)
.config(&quot;spark.task.cpus&quot;, 1)
.appName(&quot;Spark SQL basic example&quot;)
.config(&quot;spark.some.config.option&quot;, &quot;some-value&quot;)
.getOrCreate()
read_spark()
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Scala中使用Spark SQL创建用于读取数据的Predicate

问题

答案1

Scala 3. 种类多态和 AnyKind 类型 – 有任何代码示例吗？

In Scala 3: Why runtime pattern matching can't work reliably on duck type using JVM reflection?

如何在Synapse中将SQL文件导出到沙盒环境或直接通过笔记本访问这些SQL文件？

设置 `spark.sql.files.maxPartitionBytes` 时出现了倾斜的分区。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论