2023年2月27日 03:12:38go评论84阅读模式

英文:

pyspark splitting with delimiter incorrectly (within higher order)?

问题

我将为您翻译您提供的Python代码部分：

My task is to convert this spark python udf to pyspark native functions...

我的任务是将这个Spark Python UDF转换为PySpark原生函数...

def parse_access_history_json_table(json_obj):
    '''
    extracts
    list of DBs, Schemas, Tables
    from json_obj field in access history
    '''
    从访问历史中的json_obj字段中提取数据库、模式和表的列表
    db_list = set([])
    schema_list = set([])
    table_list = set([])
    try:
        for full_table_name in json_obj:
            full_table_name = full_table_name.lower()
            full_table_name_array = full_table_name.split(".")
            if len(full_table_name_array) == 3:
                db_list.add(full_table_name_array[0])
                schema_list.add(full_table_name_array[1])
                table_list.add(full_table_name_array[2])
            elif len(full_table_name_array) == 2:
                schema_list add(full_table_name_array[0])
                table_list add(full_table_name_array[1])
            else:
                table_list.add(full_table_name_array[0])
    except Exception as e:
        print(str(e))
    return (list(db_list), list(schema_list), list(table_list))

json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']

json_obj是一个数组类型的列，示例值为[db1.s1.t1, s2.t2, t3]
最终希望有3列，每一列都是数据库/模式/表的数组等等... ['db1] ['s1','s2'] & 表 ['t1','t2','t3']

I'm trying to convert this to a list of lists of x, y, z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)

我试图首先将它转换为x、y、z的列表，但分割出现问题。
(不确定是否有更快、更好的方法，而不使用UDF/lambda，即Python转换)

from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

test_list = [['database.schema.table']], [['t3','d1.s1.t1','s2.t2']]

df = spark.createDataFrame(test_list, schema=cSchema)

df.withColumn("Filtered_Col", expr(f"transform(WordList, x -> split(x, '\\.')  )")).show()

is giving

产生的结果是

+-----------------+--------------------+
|         WordList|        Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+

Not sure why. I have spark 2.4.

不确定原因。我使用的是Spark 2.4。

希望这有助于您理解这些代码的翻译。如果您有任何进一步的问题，请随时提出。

英文:

My task is to convert this spark python udf to pyspark native functions...

def parse_access_history_json_table(json_obj):
    &#39;&#39;&#39;
    extracts
    list of DBs, Schemas, Tables
    from json_obj field in access history
    &#39;&#39;&#39;
    db_list = set([])
    schema_list = set([])
    table_list = set([])
    try:
        for full_table_name in json_obj:
            full_table_name = full_table_name.lower()
            full_table_name_array = full_table_name.split(&quot;.&quot;)
            if len(full_table_name_array) == 3:
                db_list.add(full_table_name_array[0])
                schema_list.add(full_table_name_array[1])
                table_list.add(full_table_name_array[2])
            elif len(full_table_name_array) == 2:
                schema_list.add(full_table_name_array[0])
                table_list.add(full_table_name_array[1])
            else:
                table_list.add(full_table_name_array[0])
    except Exception as e:
        print(str(e))
    return (list(db_list), list(schema_list), list(table_list))

json_obj is array type col with example value [db1.s1.t1, s2.t2, t3]
eventually want 3 cols for db/schema/table each as an array of dbs etc... ['db1] ['s1','s2'] & tables ['t1','t2','t3']

I'm trying to convert this to a list of lists of x,y,z at first but splitting is going wrong.
(not sure if there's a faster/better way without using udf/lambda i.e python transformations)

from pyspark.sql.types import *

cSchema = StructType([StructField(&quot;WordList&quot;, ArrayType(StringType()))])

test_list = [[&#39;database.schema.table&#39;]], [[&#39;t3&#39;,&#39;d1.s1.t1&#39;,&#39;s2.t2&#39;]]

df = spark.createDataFrame(test_list,schema=cSchema)

df.withColumn(&quot;Filtered_Col&quot;, expr(f&quot;transform(WordList,x -&gt; split(x,&#39;\.&#39;)  )&quot;)).show()

is giving

+-----------------+--------------------+
|         WordList|        Filtered_Col|
+-----------------+--------------------+
|[share_db.sc.tbl]|[[, , , , , , , ,...|
|[x a, d.s.t, d.s]|[[, , , ], [, , ,...|
+-----------------+--------------------+

Not sure why. I have spark 2.4.

答案1

得分: 1

问题在于正则表达式，是 split 函数的第二个参数。

反斜杠也必须进行转义，所以 \\\. 将导致正确的结果:

df.withColumn("Filtered_Col", expr("transform(WordList, x -> split(x,'\\.')  )")).show(truncate=False)

+-----------------------+------------------------------+
|WordList               |Filtered_Col                  |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]]   |
|[t3, d1.s1.t1, s2.t2]  |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+

英文:

The problem is the regex, the second parameter of the split function.

The backslash has to be escaped too, so \\\. will lead to the correct result:

df.withColumn(&quot;Filtered_Col&quot;, expr(&quot;transform(WordList,x -&gt; split(x,&#39;\\\.&#39;)  )&quot;)).show(truncate=False)

+-----------------------+------------------------------+
|WordList               |Filtered_Col                  |
+-----------------------+------------------------------+
|[database.schema.table]|[[database, schema, table]]   |
|[t3, d1.s1.t1, s2.t2]  |[[t3], [d1, s1, t1], [s2, t2]]|
+-----------------------+------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pyspark 使用分隔符分割时出现错误（在高阶内部）？

问题

答案1

Use golang to achieve python's timer

Golang中与Python的getattr()或call()等效的方法是什么？

将Python Pandas输出添加到现有数据框的列中。

Python多进程与tqdm：进度条不更新

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论