如何使用PySpark JDBC连接器删除Teradata中的表格?

huangapple go评论88阅读模式
英文:

How to drop table using pyspark jdbc connector to teradata?

问题

I can help with the translation. Here's the translated text without the code:

我可以从Teradata数据库中选择,但无法使用pyspark删除。

我还尝试使用jaydebeapi在同一个Spark会话中删除表格,那样可以工作。希望有人遇到过相同的问题。

错误信息:

Py4JJavaError: 在调用o265.load时发生错误。: java.sql.SQLException: [Teradata Database] [TeraJDBC 17.20.00.15] [错误 3707] [SQLState 42000] 语法错误,期望类似于名称、Unicode delimited标识符、'UDFCALLNAME'关键字、'SELECT'关键字或'('之间的'('和'DROP'关键字。

英文:

I can select from the teradata database, but I cannot drop using pyspark.

I have also used jaydebeapi to drop the table in the same spark session and that works.
Was hoping someone may have encountered the same issue.

drop_sql = """ (DROP TABLE <DB_NAME>.<TABLENAME>) """


conn = spark.read \
.format("jdbc") \
.option("driver","com.teradata.jdbc.TeraDriver") \
.option("url","jdbc:teradata://<IP_ADDRESS>/DATABASE=. <DB_NAME>,TMODE=ANSI,CHARSET=UTF8,TYPE=FASTLOAD,LOGMECH=LDAP") \
.option("query", drop_sql) \
.option("user", user) \
.option("password",password)\
.option("fetchsize",10000).load()

ERROR:

Py4JJavaError: An error occurred while calling o265.load.
: java.sql.SQLException: [Teradata Database] [TeraJDBC 17.20.00.15] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or a 'SELECT' keyword or '(' between '(' and the 'DROP' keyword.

答案1

得分: 1

spark.read 提供了更高级别的语言。它不是Python的Terradata驱动程序。

  • 传递给 spark.read.format('jdbc').option(query, '...') 的查询只能包含 SELECT 语句。
  • 无论您在那里提供什么,在Spark代码中都会被包装在一个外部的 SELECT 中,然后发送到底层驱动程序执行。例如:

spark.read.format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", "SELECT c1, c2 FROM t1") \
    .option("partitionColumn", "partiion_id") \
    .option("lowerBound", "1") \
    .option("upperBound", "300") \
    .option("numPartitions", "3") \
    .load()

将在底层数据库上并行执行3个类似这样的查询。请注意,真正的查询可能会稍有不同,这只是为了学术目的而编写的示例:

SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 1 AND 100) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 200) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 300) t

所以在你的情况下,Terradata不高兴是因为Spark执行了类似以下内容的操作:

SELECT t.* FROM (DROP TABLE <DB_NAME>.<TABLENAME>) t


你拥有的不是 "pyspark jdbc connector to teradata"。它是 "Terradata JDBC driver"。

要在Terradata上运行Terradata特定的SQL,你需要编写使用Terradata特定驱动程序的Python代码。这里有一个示例

import teradatasql

with teradatasql.connect (host="whomooz", user="guest", password="please") as con:
    with con.cursor () as cur:
        try:
            sRequest = "DROP TABLE <DB_NAME>.<TABLENAME>"
            print (sRequest)
            cur.execute (sRequest)
        except Exception as ex:
            print ("Ignoring", str (ex).split ("\n") [0])

如果你想在Databricks/Spark集群上运行此代码,那么你需要在相关的集群上添加jdbc驱动程序库,例如,作为集群库。然后在该集群上运行上面的代码。

我假设您已经在得到错误消息之前完成了这些操作。

英文:

spark.read provides a higher level language. It's not a Terradata driver for python.

  • Queries you pass to spark.read.format(&#39;jdbc&#39;).option(query, &#39;...&#39;) can only contain SELECT statements.
  • Whatever you provide there is in-turn wrapped in an outer SELECT by spark code before it's sent to underlying driver for execution. E.g.

spark.read.format(&quot;jdbc&quot;) \
    .option(&quot;url&quot;, jdbcUrl) \
    .option(&quot;query&quot;, &quot;SELECT c1, c2 FROM t1&quot;) \
    .option(&quot;partitionColumn&quot;, &quot;partiion_id&quot;) \
    .option(&quot;lowerBound&quot;, &quot;1&quot;) \
    .option(&quot;upperBound&quot;, &quot;300&quot;) \
    .option(&quot;numPartitions&quot;, &quot;3&quot;) \
    .load()

Would translate to 3 queries like these being executed in parallel on underlying DB. Please note real ones would be slightly different, this is curated for academic purpose:

SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 1 AND 100) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 200) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 300) t

So in your case Terradata is unhappy because Spark is executing something along the lines of:

SELECT t.* FROM (DROP TABLE &lt;DB_NAME&gt;.&lt;TABLENAME&gt;) t


What you have is not "pyspark jdbc connector to teradata". It's "Terradata JDBC driver".

To run Terradata specific SQL on Terradata you need write python code that uses Terradata specific driver. Here is an example.

import teradatasql

with teradatasql.connect (host=&quot;whomooz&quot;, user=&quot;guest&quot;, password=&quot;please&quot;) as con:
    with con.cursor () as cur:
        try:
            sRequest = &quot;DROP TABLE &lt;DB_NAME&gt;.&lt;TABLENAME&gt;&quot;
            print (sRequest)
            cur.execute (sRequest)
        except Exception as ex:
            print (&quot;Ignoring&quot;, str (ex).split (&quot;\n&quot;) [0])

If you want to run this code on Databricks/Spark-cluster then you'll have to add the jdbc driver library on the cluster in question. E.g. as a cluster library. And then run the code above on that cluster.

I assume you've already done this given the error you're getting.

huangapple
  • 本文由 发表于 2023年5月18日 06:55:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276689.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定