在运行Glue脚本时Pyspark代码存在问题。

huangapple go评论60阅读模式
英文:

Issue in Pyspark code when running Glue Script

问题

在填充 dnb_df 时出现的错误是因为查询字符串 var 包含了单引号('")作为字符串定界符,而这可能会导致 Oracle 数据库解析错误。在这种情况下,你可以尝试使用双引号(") 来定界查询字符串,因为 Oracle 通常使用双引号来定界对象标识符(表名、列名等)。下面是已修复的查询字符串:

var = "select DNB_RESULTS_DTL_ID, REQUEST_ID, source_id, dnb_level from (select max(request_id) over(partition by org_code) max_request_id, a.* from XXGMDMADM.MDM_DNB_RESULTS_DTL a where dnb_level = 'LVL1' and request_id in (131)) where MAX_REQUEST_ID = request_id"

使用双引号定界查询字符串应该能够解决这个问题。如果你使用单引号,Oracle 可能会将其中的内容解释为字符串,而不是标识符,从而导致语法错误。

英文:



import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import regexp_replace, col


args = getResolvedOptions(sys.argv, ['JOB_NAME'])





sc = SparkContext()
#sc.setLogLevel('DEBUG')
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#logger = glueContext.get_logger()
#logger.DEBUG('Hello Glue')
job = Job(glueContext)
job.init(args["JOB_NAME"], args)



# ####connect to database
client = boto3.client('glue', region_name='XXXXXX')
response = client.get_connection(Name='XXXXXX')
connection_properties = response['Connection']['ConnectionProperties']
URL = connection_properties['JDBC_CONNECTION_URL']
url_list = URL.split("/")
host = "{}".format(url_list[-2][:-5])
new_host=host.split('@',1)[1]
port = url_list[-2][-4:]
database = "{}".format(url_list[-1])
Oracle_Username = "{}".format(connection_properties['USERNAME'])
Oracle_Password = "{}".format(connection_properties['PASSWORD'])

#print("Oracle_Username:",Oracle_Username)
#print("Oracle_Password:",Oracle_Password)
print("Host:",host)
print("New Host:",new_host)
print("Port:",port)
print("Database:",database)
Oracle_jdbc_url="jdbc:oracle:thin:@//"+new_host+":"+port+"/"+database
print("Oracle_jdbc_url:",Oracle_jdbc_url)
source_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", "xxgmdmadm.vendor_data").option("user", Oracle_Username).option("password", Oracle_Password).load()
#store the value in array 
qrys = source_df.select("SRC_QUERY").collect()
var='select DNB_RESULTS_DTL_ID,REQUEST_ID,source_id,dnb_level   from (select  max(request_id) over(partition by org_code ) max_request_id,a.* from XXGMDMADM.MDM_DNB_RESULTS_DTL a   where  dnb_level =''LVL1'' and request_id in (131) ) where MAX_REQUEST_ID=request_id'
dnb_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("query", var).option("user", Oracle_Username).option("password", Oracle_Password).load()

error I am getting while populating dnb_df

> An error occurred while calling o111.load. ORA-00911: invalid
> character

Not getting what is wrong with this part

> var='select DNB_RESULTS_DTL_ID,REQUEST_ID,source_id,dnb_level from
> (select max(request_id) over(partition by org_code )
> max_request_id,a.* from XXGMDMADM.MDM_DNB_RESULTS_DTL a where
> dnb_level =''LVL1'' and request_id in (131) ) where
> MAX_REQUEST_ID=request_id'

Even when I am running simple query like getting same error

> var="select DNB_RESULTS_DTL_ID,REQUEST_ID,source_id,dnb_level from
> XXGMDMADM.MDM_DNB_RESULTS_DTL"

Does query option doesn't work for Oracle ?

答案1

得分: 1

你可以在提出查询时使用别名,效果很好:

.option("dbtable", "(SELECT * FROM schema.tablename) tbl")
英文:

You can use alias while giving a query, works well:

.option("dbtable", "(SELECT * FROM schema.tablename) tbl")

huangapple
  • 本文由 发表于 2023年2月18日 04:33:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75489027.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定