英文:
pyspark splitting with delimiter incorrectly (within higher order)?
问题
我将为您翻译您提供的Python代码部分:
My task is to convert this spark python udf to pyspark native functions...
我的任务是将这个Spark Python UDF转换为PySpark原生函数...
def parse_access_history_json_table(json_obj):
'''
extracts
list of DBs, Schemas, Tables
from json_obj field in access history
'''
从访问历史中的json_obj字段中提取数据库、模式和表的列表
db_list = set([])
schema_list = set([])
table_list = set([])
try:
for full_table_name in json_obj:
full_table_name = full_table_name.lower()
full_table_name_array = full_table_name.split(".")
if len(full_table_name_array) == 3:
db_list.add(full_table_name_array[0])
schema_list.add(full_table_name_array[1])
table_list.add(full_table_name_array[2])
elif len(full_table_name_array) == 2:
schema_list add(full_table_name_array[0])
table_list add(full_table_name_array[1])
else:
table_list.add(full_table_name_array[0])
except Exception as e:
print(str(e))
return (list(db_list), list(schema_list), list(table_list))
json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']
json_obj是一个数组类型的列,示例值为[db1.s1.t1, s2.t2, t3]
最终希望有3列,每一列都是数据库/模式/表的数组等等... ['db1] ['s1','s2'] & 表 ['t1','t2','t3']
I'm trying to convert this to a list of lists of x, y, z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)
我试图首先将它转换为x、y、z的列表,但分割出现问题。
(不确定是否有更快、更好的方法,而不使用UDF/lambda,即Python转换)
from pyspark.sql.types import *
cSchema = StructType([StructField("WordList", ArrayType(StringType()))])
test_list = [['database.schema.table']], [['t3','d1.s1.t1','s2.t2']]
df = spark.createDataFrame(test_list, schema=cSchema)
df.withColumn("Filtered_Col", expr(f"transform(WordList, x -> split(x, '\\.') )")).show()
is giving
产生的结果是
+-----------------+--------------------+
| WordList| Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+
Not sure why. I have spark 2.4.
不确定原因。我使用的是Spark 2.4。
希望这有助于您理解这些代码的翻译。如果您有任何进一步的问题,请随时提出。
英文:
My task is to convert this spark python udf to pyspark native functions...
def parse_access_history_json_table(json_obj):
'''
extracts
list of DBs, Schemas, Tables
from json_obj field in access history
'''
db_list = set([])
schema_list = set([])
table_list = set([])
try:
for full_table_name in json_obj:
full_table_name = full_table_name.lower()
full_table_name_array = full_table_name.split(".")
if len(full_table_name_array) == 3:
db_list.add(full_table_name_array[0])
schema_list.add(full_table_name_array[1])
table_list.add(full_table_name_array[2])
elif len(full_table_name_array) == 2:
schema_list.add(full_table_name_array[0])
table_list.add(full_table_name_array[1])
else:
table_list.add(full_table_name_array[0])
except Exception as e:
print(str(e))
return (list(db_list), list(schema_list), list(table_list))
json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']
I'm trying to convert this to a list of lists of x,y,z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)
from pyspark.sql.types import *
cSchema = StructType([StructField("WordList", ArrayType(StringType()))])
test_list = [['database.schema.table']], [['t3','d1.s1.t1','s2.t2']]
df = spark.createDataFrame(test_list,schema=cSchema)
df.withColumn("Filtered_Col", expr(f"transform(WordList,x -> split(x,'\.') )")).show()
is giving
+-----------------+--------------------+
| WordList| Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+
Not sure why. I have spark 2.4.
答案1
得分: 1
问题在于正则表达式,是 split 函数的第二个参数。
反斜杠也必须进行转义,所以 \\\.
将导致正确的结果:
df.withColumn("Filtered_Col", expr("transform(WordList, x -> split(x,'\\.') )")).show(truncate=False)
+-----------------------+------------------------------+
|WordList |Filtered_Col |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]] |
|[t3, d1.s1.t1, s2.t2] |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+
英文:
The problem is the regex, the second parameter of the split function.
The backslash has to be escaped too, so \\\.
will lead to the correct result:
df.withColumn("Filtered_Col", expr("transform(WordList,x -> split(x,'\\\.') )")).show(truncate=False)
+-----------------------+------------------------------+
|WordList |Filtered_Col |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]] |
|[t3, d1.s1.t1, s2.t2] |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论