pyspark 使用分隔符分割时出现错误(在高阶内部)?

huangapple go评论69阅读模式
英文:

pyspark splitting with delimiter incorrectly (within higher order)?

问题

我将为您翻译您提供的Python代码部分:

My task is to convert this spark python udf to pyspark native functions...

我的任务是将这个Spark Python UDF转换为PySpark原生函数...

def parse_access_history_json_table(json_obj):
    '''
    extracts
    list of DBs, Schemas, Tables
    from json_obj field in access history
    '''
    从访问历史中的json_obj字段中提取数据库模式和表的列表
    db_list = set([])
    schema_list = set([])
    table_list = set([])
    try:
        for full_table_name in json_obj:
            full_table_name = full_table_name.lower()
            full_table_name_array = full_table_name.split(".")
            if len(full_table_name_array) == 3:
                db_list.add(full_table_name_array[0])
                schema_list.add(full_table_name_array[1])
                table_list.add(full_table_name_array[2])
            elif len(full_table_name_array) == 2:
                schema_list add(full_table_name_array[0])
                table_list add(full_table_name_array[1])
            else:
                table_list.add(full_table_name_array[0])
    except Exception as e:
        print(str(e))
    return (list(db_list), list(schema_list), list(table_list))
json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']

json_obj是一个数组类型的列示例值为[db1.s1.t1, s2.t2, t3]
最终希望有3列每一列都是数据库/模式/表的数组等等... ['db1] ['s1','s2'] & 表 ['t1','t2','t3']
I'm trying to convert this to a list of lists of x, y, z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)

我试图首先将它转换为xyz的列表但分割出现问题
(不确定是否有更快更好的方法而不使用UDF/lambda即Python转换)
from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

test_list = [['database.schema.table']], [['t3','d1.s1.t1','s2.t2']]

df = spark.createDataFrame(test_list, schema=cSchema)

df.withColumn("Filtered_Col", expr(f"transform(WordList, x -> split(x, '\\.')  )")).show()
is giving

产生的结果是

+-----------------+--------------------+
|         WordList|        Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+
Not sure why. I have spark 2.4.

不确定原因我使用的是Spark 2.4

希望这有助于您理解这些代码的翻译。如果您有任何进一步的问题,请随时提出。

英文:

My task is to convert this spark python udf to pyspark native functions...

def parse_access_history_json_table(json_obj):
    '''
    extracts
    list of DBs, Schemas, Tables
    from json_obj field in access history
    '''
    db_list = set([])
    schema_list = set([])
    table_list = set([])
    try:
        for full_table_name in json_obj:
            full_table_name = full_table_name.lower()
            full_table_name_array = full_table_name.split(".")
            if len(full_table_name_array) == 3:
                db_list.add(full_table_name_array[0])
                schema_list.add(full_table_name_array[1])
                table_list.add(full_table_name_array[2])
            elif len(full_table_name_array) == 2:
                schema_list.add(full_table_name_array[0])
                table_list.add(full_table_name_array[1])
            else:
                table_list.add(full_table_name_array[0])
    except Exception as e:
        print(str(e))
    return (list(db_list), list(schema_list), list(table_list))

json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']

I'm trying to convert this to a list of lists of x,y,z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)

from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

test_list = [['database.schema.table']], [['t3','d1.s1.t1','s2.t2']]

df = spark.createDataFrame(test_list,schema=cSchema)

df.withColumn("Filtered_Col", expr(f"transform(WordList,x -> split(x,'\.')  )")).show()

is giving

+-----------------+--------------------+
|         WordList|        Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+

Not sure why. I have spark 2.4.

答案1

得分: 1

问题在于正则表达式,是 split 函数的第二个参数。

反斜杠也必须进行转义,所以 \\\. 将导致正确的结果:

df.withColumn("Filtered_Col", expr("transform(WordList, x -> split(x,'\\.')  )")).show(truncate=False)
+-----------------------+------------------------------+
|WordList               |Filtered_Col                  |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]]   |
|[t3, d1.s1.t1, s2.t2]  |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+
英文:

The problem is the regex, the second parameter of the split function.

The backslash has to be escaped too, so \\\. will lead to the correct result:

df.withColumn("Filtered_Col", expr("transform(WordList,x -> split(x,'\\\.')  )")).show(truncate=False)
+-----------------------+------------------------------+
|WordList               |Filtered_Col                  |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]]   |
|[t3, d1.s1.t1, s2.t2]  |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+

huangapple
  • 本文由 发表于 2023年2月27日 03:12:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574401.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定