2023年6月26日 22:22:57go评论97阅读模式

英文:

PySpark: Create new rows (explode) based on a number in a column and multiple conditions

问题

以下是您提供的代码的中文翻译部分：

# unique ID-month-split combinations for the data
idMonthSplits = call_data.select('id', 'month', 'split').distinct().collect()
# 设置模式为所有列，除了坏电话标志，该标志在循环中设置为1
master_explode = spark.createDataFrame([], schema=call_data.select([col for col in call_data.columns if col != 'bad_call_flag']).schema)
# 循环
for ims in idMonthSplits:
    id = ims['id']
    month = ims['month']
    split = ims['split']
    # 按bad_call_dist的值将df展开成n行，其中n是bad_call_dist中的值。
    explode_df = exploded.filter((exploded['id'] == id) & (exploded['month'] == month) & (exploded['split'] == split))\
        .withColumn('bad_call_flag', F.lit(1))
    try:
        # 提取表示要展开的行数的值
        expVal = explode_df.select(F.first(F.col("bad_call_dist")).cast("int")).first()[0]
        # explode()使用的范围，将单行转换为多行
        explode_df = explode_df.withColumn(
            'range',
            F.array(
                [F.lit(i) for i in range(expVal + 1)]
            )
        )
        # 展开df，然后删除不再需要的列以进行合并
        explode_df = explode_df.withColumn('explode', F.explode(F.col('range')))\
            .drop(*['explode', 'range', 'bad_call_dist'])
        # 合并到主df
        master_explode = master_explode.unionAll(explode_df)
    # 如果explode值为0，无需扩展行。这会触发以避免错误。
    except:
        continue

请注意，这只是您提供的代码的翻译部分。如果您需要任何其他帮助或解释，请随时提出。

英文:

I have a dataframe with a few columns, a unique ID, a month, and a split. I need to explode the dataframe and create new rows for each unique combination of id, month, and split. The number to explode has already been calculated and is stored in the column, bad_call_dist. For example, if ID is 12345, month is Jan, split is 'A', and bad_call_dist is 6, I need to have a total of 6 rows. This process must repeat for each unique combination.

I have code that works for small datasets, however I need to apply it to a much, much larger dataframe and it times out each time. What the code below does is extract a single-row dataframe from the original data with a temporary range column representing how many rows must exist for a unique col combination. I then use explode() to generate the new rows and union that into a master dataframe. I'm looking for assistance to optimize the code and speed up processing times while producing the same outcome:

# unique ID-month-split combinations for the data
idMonthSplits = call_data.select(&#39;id&#39;, &#39;month&#39;, &#39;split&#39;).distinct().collect()
# set the schema to all cols except the bad call flag, which is set to 1 in the loop
master_explode = spark.createDataFrame([], schema=call_data.select([col for col in call_data.columns if col != &#39;bad_call_flag&#39;]).schema)
# loop
for ims in idMonthSplits:
id = ims [&#39;id&#39;]
month = ims [&#39;month&#39;]
split = ims [&#39;split&#39;]
# explode the df one row per n, where n is the value in bad_call_dist.
explode_df = exploded.filter((exploded[&#39;id&#39;] == id) &amp; (exploded[&#39;month&#39;] == month) &amp; (exploded[&#39;split&#39;] == split))\
    .withColumn(&#39;bad_call_flag&#39;, F.lit(1))
try:
    
    # extract the value that represents the number of rows to explode
    expVal = explode_df.select(F.first(F.col(&quot;bad_call_dist&quot;)).cast(&quot;int&quot;)).first()[0]
    # range that is used by explode() to convert single row to multiple rows
    explode_df = explode_df.withColumn(
        &#39;range&#39;,
        F.array(
            [F.lit(i) for i in range(expVal + 1)]
        )
    )
    # explode the df, then drop cols no longer needed for union
    explode_df = explode_df.withColumn(&#39;explode&#39;, F.explode(F.col(&#39;range&#39;)))\
        .drop(*[&#39;explode&#39;, &#39;range&#39;, &#39;bad_call_dist&#39;])
    # union to master df
    master_explode = master_explode.unionAll(explode_df)
# if the explode value is 0, no need to expand rows. This triggers to avoid an error.
except:
    continue

答案1

得分: 1

在Spark中，循环几乎总是灾难性的。最好尽量使用Spark函数，因为它们可以在内部进行优化，你可以使用expr()内的array_repeat()来解决你的情况。

这里有一个示例：

# 给定以下数据
# +---+-----+-----+-------------+
# | id|month|split|bad_call_dist|
# +---+-----+-----+-------------+
# |  1|  Jan|    A|            6|
# |  1|  Feb|    A|            8|
# +---+-----+-----+-------------+
# 创建一个虚拟数组，使用`array_repeat`来展开它
data_sdf. \
    withColumn('dummy_arr', func.expr('array_repeat(1, cast(bad_call_dist as int))')). \
    selectExpr(*data_sdf.columns, 'explode(dummy_arr) as exp_dummy'). \
    show()
# +---+-----+-----+-------------+---------+
# |id |month|split|bad_call_dist|exp_dummy|
# +---+-----+-----+-------------+---------+
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# +---+-----+-----+-------------+---------+

注意，我在expr内使用了array_repeat。这是因为你想要重复的次数来自一列，而Spark原生函数不接受第二个参数中的列，但SQL函数可以。

data_sdf. \
    withColumn('dummy_arr', func.expr('array_repeat(1, cast(bad_call_dist as int))')). \
    show(truncate=False)
# +---+-----+-----+-------------+------------------------+
# |id |month|split|bad_call_dist|dummy_arr               |
# +---+-----+-----+-------------+------------------------+
# |1  |Jan  |A    |6            |[1, 1, 1, 1, 1, 1]      |
# |1  |Feb  |A    |8            |[1, 1, 1, 1, 1, 1, 1, 1]|
# +---+-----+-----+-------------+------------------------+

英文:

loops are almost always disastrous in spark. it is better to use spark functions as much as you can, as they can be internally optimized, and your situation can be solved using array_repeat() within expr().

Here's an example

# given the following data
# +---+-----+-----+-------------+
# | id|month|split|bad_call_dist|
# +---+-----+-----+-------------+
# |  1|  Jan|    A|            6|
# |  1|  Feb|    A|            8|
# +---+-----+-----+-------------+
# create a dummy array to explode using `array_repeat` and explode it
data_sdf. \
    withColumn(&#39;dummy_arr&#39;, func.expr(&#39;array_repeat(1, cast(bad_call_dist as int))&#39;)). \
    selectExpr(*data_sdf.columns, &#39;explode(dummy_arr) as exp_dummy&#39;). \
    show()
# +---+-----+-----+-------------+---------+
# |id |month|split|bad_call_dist|exp_dummy|
# +---+-----+-----+-------------+---------+
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Jan  |A    |6            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# |1  |Feb  |A    |8            |1        |
# +---+-----+-----+-------------+---------+

notice that I used array_repeat within expr. that's because you want the number of times to repeat to come from a column and spark native function does not accept column in the second parameter, but the SQL function does.

data_sdf. \
    withColumn(&#39;dummy_arr&#39;, func.expr(&#39;array_repeat(1, cast(bad_call_dist as int))&#39;)). \
    show(truncate=False)
# +---+-----+-----+-------------+------------------------+
# |id |month|split|bad_call_dist|dummy_arr               |
# +---+-----+-----+-------------+------------------------+
# |1  |Jan  |A    |6            |[1, 1, 1, 1, 1, 1]      |
# |1  |Feb  |A    |8            |[1, 1, 1, 1, 1, 1, 1, 1]|
# +---+-----+-----+-------------+------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark：根据列中的数字和多个条件创建新行（展开）。

问题

答案1

如何使用AWS SDK更改EC2启动模板中的实例类型？

Use sort_values ascending for value greater than 100.

在第二行按顺序对 Pandas 数据框列进行排序。

无法将二维数组重新调整为三维数组以供 Python 中的 RNN 使用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。