将参数传递给使用 `spark.read.format(jdbc)` 格式的查询。

huangapple go评论84阅读模式
英文:

Pass parameters to query with spark.read.format(jdbc) format

问题

我正在尝试通过spark.read.format("jdbc")在Redshift中执行以下示例查询:

query = "select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?"

我尝试执行此查询如下:

# 从表中读取查询参数并将其作为数据框获取
query_parameters = ("2022-01-01", "2023-01-01")

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query", query) \
  .option("user", "user") \
  .option("password", pass) \
  .option("queryParameters", ",".join(query_parameters)) \
  .option("driver", driver) \
  .load()

我收到以下错误:

com.amazon.redshift.util.RedshiftException: ERROR: invalid input syntax for type date: "?"

在使用pyspark中的spark.read执行查询时,我们可以使用?作为占位符来传递参数。但是,你需要将占位符替换为具体的参数值。在你的示例中,你应该使用query_parameters中的日期值来替换query中的?。以下是修复后的代码示例:

query = "select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?"

query_parameters = ("2022-01-01", "2023-01-01")

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query", query) \
  .option("user", "user") \
  .option("password", pass) \
  .option("queryParameters", ",".join(query_parameters)) \
  .option("driver", driver) \
  .load()

这将会将占位符?替换为实际的日期值,以便正确执行查询。

英文:

I am trying to execute following sample query throug spark.read.format("jdbc") in redshift

query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?"

I am trying to execute this query as follows:

> I read query_parameters from a table and I am getting a dataframe and then made it as a list by using df.collect() statement and then converting to tuple to pass it as parameters

query_parameters = ("2022-01-01","2023-01-01")

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query",query) \
  .option("user","user") \
  .option("password", pass) \
  .option("queryParameters", ",".join(query_parameters)) \
  .option("driver", driver) \
  .load()

I am getting following error:

com.amazon.redshift.util.RedshiftException: ERROR: invalid input syntax for type date: \"?\"\r\n\tat

How do we pass parameters to the query when we are executing through spark.read in pyspark

答案1

得分: 1

根据官方 spark-redshift 实现,似乎没有名为 queryParameters 的选项可用。不确定你在哪里找到它,但我在官方的 github 代码 中找不到它。

将参数传递给查询的唯一方法是通过 Python 字符串拼接或插值,并设置驱动程序的 query 选项,如在相应的测试套件 RedshiftReadSuite.scala 中所示。

对于上面的示例,应该可以这样做:

query_parameters = ("2022-01-01", "2023-01-01")

query = f"select * from tableA join tableB on a.id = b.id where a.date > '{query_parameters[0]}' and b.date > '{query_parameters[1]}'"

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query", query) \
  .option("user", "user") \
  .option("password", pass) \
  .option("driver", driver) \
  .load()
英文:

According to official spark-redshift implementation, it seems that there is no option named queryParameters available. Not sure where you found it, but I wasn't able to find it in the official github code.

The only way to pass parameters to your query is through Python string concatenation or interpolation and setting the query option of the driver, as shown in the corresponding test suite RedshiftReadSuite.scala.

For the above example this should work:

query_parameters = ("2022-01-01","2023-01-01")

query=f"select * from tableA join tableB on a.id = b.id where a.date > '{query_parameters[0]}'' and b.date > '{query_parameters[1]}'"

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query",query) \
  .option("user","user") \
  .option("password", pass) \
  .option("driver", driver) \
  .load()

huangapple
  • 本文由 发表于 2023年7月23日 15:57:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76747189.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定