2023年3月31日 23:10:43go评论96阅读模式

英文:

Efficient way to replace values of multiple columns based on a dictionary map using pyspark

问题

I need to replace values of multiple columns (100s-1000s of columns) of a large parquet file. I am using pyspark.

I have a working implementation using replace that works with fewer number of columns, but when the number of columns is in the order of 100s it is taking a long time to even generate the spark plan from what I can see(> 3-4s for each column). So, I am looking for an implementation that is faster.

value_label_map = {"col1": {"val1": "new_val1"}, "col2": {"val2": "new_val2"}}
for k, v in value_label_map.items():
    print(f"replacing {k}")
    columns_to_replace.append(k)
    df = df.replace(to_replace=v, subset=k)

I tried an alternate approach, but I couldn't find a way to access the value of pyspark Column object to be able to look up the dict.

Alternate impl

def replace_values(col, value_map):
    if value_map:
        return when(col.isin(list(value_map.keys())), value_label_map[col]).otherwise(col)
    else:
        return col
df = spark.read.parquet("some-path")
updated_cols = [replace_values(df[col_name], value_labels.get(col_name)).alias(col_name) for col_name in df_values_renamed.columns]

The problem with this is that I can't look up value_labels using a column object.

英文:

I need to replace values of multiple columns (100s-1000s of columns) of a large parquet file. I am using pyspark.

value_label_map = {&quot;col1&quot;: {&quot;val1&quot;: &quot;new_val1&quot;}, &quot;col2&quot;: {&quot;val2&quot;: &quot;new_val2&quot;}}
for k, v in value_label_map.items():
    print(f&quot;replacing {k}&quot;)
    columns_to_replace.append(k)
    df = df.replace(to_replace=v, subset=k)

I tried an alternate approach, but I couldn't find a way to access the value of pyspark Column object to be able to look up the dict.

Alternate impl

def replace_values(col, value_map):
    if value_map:
        return when(col.isin(list(value_map.keys())),value_label_map[col]).otherwise(col)
    else:
        return col
df = spark.read.parquet(&quot;some-path&quot;)
updated_cols = [replace_values(df[col_name], value_labels.get(col_name)).alias(col_name) for col_name in df_values_renamed.columns]

the problem with this is that I can't look up value_labels using column object.

答案1

得分: 1

你可以尝试将所有内容放在一个select语句中。由于replace基于when语句，让我们直接使用它们：

def replace_from_dict(col_name, dict):
    """对于字典中的每个(k,v)项，将col_name中的值k替换为值v。"""
    res = None
    for k, v in dict.items():
        if res is None:
            res = F.when(F.col(col_name) == k, F.lit(v))
        else:
            res = res.when(F.col(col_name) == k, F.lit(v))
    return res.otherwise(F.col(col_name)).alias(col_name)
def replace_or_not(col_name):
    """如果需要，生成列替换，否则保留列。"""
    if col_name in value_label_map:
        return replace_from_dict(col_name, value_label_map[col_name])
    else:
        return col_name
result = df.select(*[replace_or_not(c) for c in df.columns])

英文:

You could try packing everything in one select. Since replace is based on when statements, let's use them directly:

def replace_from_dict(col_name, dict):
    &quot;&quot;&quot;for each (k,v) item in dict, replace value k from col_name by value v.&quot;&quot;&quot;
    res = None
    for k, v in dict.items():
        if res is None:
            res = F.when(F.col(col_name) == k, F.lit(v))
        else:
            res = res.when(F.col(col_name) == k, F.lit(v))
    return res.otherwise(F.col(col_name)).alias(col_name)
def replace_or_not(col_name):
    &quot;&quot;&quot;generate a column replacement if need be, keeping the column otherwise&quot;&quot;&quot;
    if col_name in value_label_map:
        return replace_from_dict(col_name, value_label_map[col_name])
    else:
        return col_name
result = df.select(*[replace_or_not(c) for c in df.columns])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pyspark基于字典映射以高效方式替换多列的值。

问题

答案1

你可以使用Selenium在Python中打印出网页表格列中的所有文本。

我如何在数据框中筛选两个日期时间时间戳之间的数据？

如何修复无法使用Python和ASGI与Django Channels服务器建立WebSocket连接的错误？

如何为PyTorch张量设置多个条件的If语句

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。