2023年4月13日 20:10:14go评论167阅读模式

英文:

How can values in a Spark array column be efficiently replaced with values from a Pandas data frame?

问题

I understand your request. Here's the translated code portion:

我有一个包含销售篮子中产品ID数组的Spark数据框。

df_baskets = spark.createDataFrame(
    [(1, ["546", "689", "946"]), (2, ["546", "799"] )],
    ("case_id","basket")
)

产品数据 = pd.DataFrame({
  "product_id": ["546", "689", "946", "799"],
  "new_product_id": ["S12", "S74", "S34", "S56"]
  })

def 获取新ID(product_id: str) -> str:
  try:
    行 = 产品数据[产品数据["product_id"] == product_id]
    return 行["new_product_id"].item()
  except ValueError:
    return product_id

应用获取新ID = F.udf(lambda basket: [获取新ID(product) for product in basket], T.ArrayType(T.StringType()))

df_baskets = (
  df_baskets
    .withColumn('basket_renamed', 应用获取新ID(F.col('basket')))
)

df_baskets.show()

#+-------+---------------+---------------+
#|case_id|         basket| basket_renamed|
#+-------+---------------+---------------+
#|      1|["546", "689", "946"]|["S12", "S74", "S34"]|
#|      2|     ["546", "799"]|     ["S12", "S56"]|
#+-------+---------------+---------------+

Please note that I have translated the code as per your instructions. Let me know if you need any further assistance!

英文:

I have a Spark data frame that contains a column of arrays with product ids from sold baskets.

import pandas as pd 
import pyspark.sql.types as T
from pyspark.sql import functions as F

df_baskets = spark.createDataFrame(
    [(1, [&quot;546&quot;, &quot;689&quot;, &quot;946&quot;]), (2, [&quot;546&quot;, &quot;799&quot;] )],
    (&quot;case_id&quot;,&quot;basket&quot;)
)

df_baskets.show()

#+-------+---------------+
#|case_id|         basket|
#+-------+---------------+
#|      1|[546, 689, 946]|
#|      2|     [546, 799]|
#+-------+---------------+

I would like to replace the product ids in each array with new ids given in a pandas data frame.


product_data = pd.DataFrame({
  &quot;product_id&quot;: [&quot;546&quot;, &quot;689&quot;, &quot;946&quot;, &quot;799&quot;],
  &quot;new_product_id&quot;: [&quot;S12&quot;, &quot;S74&quot;, &quot;S34&quot;, &quot;S56&quot;]
  })

product_data

I was able to replace the values by applying a simple python function to the column that performs a lookup on the python data frame.


def get_new_id(product_id: str) -&gt; str:
  try:
    row = product_data[product_data[&quot;product_id&quot;] == product_id]
    return row[&quot;new_product_id&quot;].item()
  except ValueError:
    return product_id

apply_get = F.udf(lambda basket: [get_new_id(product) for product in basket], T.ArrayType(T.StringType()))

df_baskets = (
  df_baskets
    .withColumn(&#39;basket_renamed&#39;, apply_get(F.col(&#39;basket&#39;)))
)

df_baskets.show()

#+-------+---------------+---------------+
#|case_id|         basket| basket_renamed|
#+-------+---------------+---------------+
#|      1|[546, 689, 946]|[S12, S74, S34]|
#|      2|     [546, 799]|     [S12, S56]|
#+-------+---------------+---------------+

However, this approach has proven to be quite slow in data frames containing several tens of millions of cases. Is there more efficient way to do this replacement (e.g. by using a different data structure than a pandas data frame or a different method)?

答案1

得分: 2

以下是您要翻译的内容：

您可以将原始数据拆分并加入到product_data（在将其转换为Spark数据框之后）：

(
    df_baskets
    .withColumn("basket", F.explode(F.col("basket")))
    .join(
        spark.createDataFrame(product_data)
        .withColumnRenamed("product_id", "basket")
        .withColumnRenamed("new_product_id", "basket_renamed"),
        on="basket"
    )
    .groupby("case_id")
    .agg(
        F.collect_list(F.col("basket")).alias("basket"),
        F.collect_list(F.col("basket_renamed")).alias("basket_renamed")
    )
).show()

输出：

|case_id|         basket| basket_renamed|
+-------+---------------+---------------+
|      1|[546, 689, 946]|[S12, S74, S34]|
|      2|     [546, 799]|     [S12, S56]|
+-------+---------------+---------------+

英文:

You could explode your original data and join on product_data (after converting it to a spark frame)

(
    df_baskets
    .withColumn(&quot;basket&quot;, F.explode(F.col(&quot;basket&quot;)))
    .join(
        spark.createDataFrame(product_data)
        .withColumnRenamed(&quot;product_id&quot;, &quot;basket&quot;)
        .withColumnRenamed(&quot;new_product_id&quot;, &quot;basket_renamed&quot;),
        on=&quot;basket&quot;
    )
    .groupby(&quot;case_id&quot;)
    .agg(
        F.collect_list(F.col(&quot;basket&quot;)).alias(&quot;basket&quot;),
        F.collect_list(F.col(&quot;basket_renamed&quot;)).alias(&quot;basket_renamed&quot;)
    )
).show()

Output:

|case_id|         basket| basket_renamed|
+-------+---------------+---------------+
|      1|[546, 689, 946]|[S12, S74, S34]|
|      2|     [546, 799]|     [S12, S56]|
+-------+---------------+---------------+

答案2

得分: 2

你可以使用RDD和map。

将pandas dataframe的行转换为{old: new}值的字典，然后在RDD中使用map来获取映射的new_product_id。

这是一个示例：

# 将pandas df转换为字典
# 还可以以其他方式完成
old_new_id_dict = {}

for i in range(len(product_data_df)):
    old_new_id_dict[product_data_df.loc[i, 'product_id']] = product_data_df.loc[i, 'new_product_id']

# {'546': 'S12', '689': 'S74', '946': 'S34', '799': 'S56'}

old_new_id_dict_bc = spark.sparkContext.broadcast(old_new_id_dict)

# 使用`map`处理值
data_sdf.rdd. \
    map(lambda r: (r.case_id, r.basket, [old_new_id_dict_bc.value[k] for k in r.basket])). \
    toDF(['case_id', 'basket', 'new_basket']). \
    show()

# +-------+---------------+---------------+
# |case_id|         basket|     new_basket|
# +-------+---------------+---------------+
# |      1|[546, 689, 946]|[S12, S74, S34]|
# |      2|     [546, 799]|     [S12, S56]|
# +-------+---------------+---------------+

希望这能帮助你。

英文:

you could use RDD and map.

convert the pandas dataframe rows to a dict of {old: new} values. then use a map in RDD to fetch the mapped new_product_id.

here's an example

# convert pandas df to dict
# can be done in other ways as well
old_new_id_dict = {}

for i in range(len(product_data_df)):
    old_new_id_dict[product_data_df.loc[i, &#39;product_id&#39;]] = product_data_df.loc[i, &#39;new_product_id&#39;]

# {&#39;546&#39;: &#39;S12&#39;, &#39;689&#39;: &#39;S74&#39;, &#39;946&#39;: &#39;S34&#39;, &#39;799&#39;: &#39;S56&#39;}

old_new_id_dict_bc = spark.sparkContext.broadcast(old_new_id_dict)

# `map` the values
data_sdf.rdd. \
    map(lambda r: (r.case_id, r.basket, [old_new_id_dict_bc.value[k] for k in r.basket])). \
    toDF([&#39;case_id&#39;, &#39;basket&#39;, &#39;new_basket&#39;]). \
    show()

# +-------+---------------+---------------+
# |case_id|         basket|     new_basket|
# +-------+---------------+---------------+
# |      1|[546, 689, 946]|[S12, S74, S34]|
# |      2|     [546, 799]|     [S12, S56]|
# +-------+---------------+---------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark数组列中的值如何高效地替换为来自Pandas数据框的值？

问题

答案1

答案2

I want to select data using ranges of longitudes and latitudes in a NetCDF4 file using Python on Windows. I can't even open the dataset with xarray

如何使用实际本地时间而不是+02:00标记。

如何在不重复昂贵工作的情况下高效地多次访问函数的返回值？

找出数据集中的最大组合数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论