2023年2月9日 01:07:24go评论49阅读模式

英文:

grouping values that are sequence in two columns pyspark

问题

我有以下数据框（df）：

index      initial_range     final_range
1            1000000              5999999
2            6000000              6299999
3            6300000              6399999
4            6400000              6499999
5            6600000              6699999
6            6700000              6749999
7            6750000              6799999
8            7000000              7399999
9            7600000              7699999
10           7700000              7749999
11           7750000              7799999
12           6500000              6549999

可以看到，'initial_range'字段和'final_range'字段是不连续的区间。当我们比较索引1和索引2时，我们可以观察到'final_range'字段的值结束后的下一个索引2的'initial_range'值是序列+1。因此，在这个示例中，索引1的值结束于5999999，而索引2的值开始于6000000。我需要将这些情况分组并返回以下数据框（df）：

index      initial_range     final_range       grouping
1            1000000              5999999      1000000-6549999
2            6000000              6299999      1000000-6549999
3            6300000              6399999      1000000-6549999
4            6400000              6499999      1000000-6549999
5            6600000              6699999      6600000-6799999
6            6700000              6749999      6600000-6799999
7            6750000              6799999      6600000-6799999
8            7000000              7399999      7000000-7399999
9            7600000              7699999      7600000-7799999
10           7700000              7749999      7600000-7799999
11           7750000              7799999      7600000-7799999
12           6500000              6549999      1000000-6549999

请注意，在'grouping'字段中有一些新的区间，这些区间是最小(initial)值和最大(final)值，直到序列被中断。

一些细节：

在索引4和5之间，序列+1被中断，所以新的'grouping'发生了变化。换句话说，每当序列被中断，就需要写入一个新的序列。
在索引12中，出现了1000000-6549999的分组，因为6500000是索引4中6499999的下一个数字。

我尝试了这段代码：

comparison = df == df.shift() + 1
df['grouping'] = comparison['initial_range'] & comparison['final_range']

但是，这个逻辑序列并没有生效。

有谁可以帮助我？

英文:

I have the follow df:

index      initial_range	  final_range
1            1000000	          5999999
2            6000000	          6299999
3            6300000	          6399999
4            6400000	          6499999
5            6600000	          6699999
6            6700000	          6749999
7            6750000	          6799999
8            7000000	          7399999
9            7600000	          7699999
10           7700000	          7749999
11           7750000	          7799999
12           6500000              6549999

See that the 'initial_range' field and 'final_range' field are intervals of abrangency.
When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:

index      initial_range	  final_range       grouping
1            1000000	          5999999    1000000-6549999
2            6000000	          6299999    1000000-6549999
3            6300000	          6399999    1000000-6549999
4            6400000	          6499999    1000000-6549999
5            6600000	          6699999    6600000-6799999
6            6700000	          6749999    6600000-6799999
7            6750000	          6799999    6600000-6799999
8            7000000	          7399999    7000000-7399999
9            7600000	          7699999    7600000-7799999
10           7700000	          7749999    7600000-7799999
11           7750000	          7799999    7600000-7799999
12           6500000              6549999    1000000-6549999

See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.

Some details:

The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.

I tried this code:

comparison = df == df.shift()+1
df[&#39;grouping&#39;] = comparison[&#39;initial_range&#39;] &amp; comparison[&#39;final_range&#39;]

But, the logic sequence, don't worked.

Can anyone help me?

答案1

得分: 1

以下是代码的中文翻译：

这个有点复杂，下面是我的答案：

首先，我在使用UDF，所以性能可能会稍差，

```python
import copy
import pyspark.sql.functions as F
from pyspark.sql.types import *

rn = 0

def check_vals(x, y):
    global rn
    
    if (y != None) and (int(x)+1) == int(y):
        return rn + 1
    else:
        # 使用copy进行深拷贝，而不是浅拷贝。
        res = copy.copy(rn)
        # 增加以便下一个值从+1开始
        rn += 1
        # 返回相同的值，因为我们要使用它进行分组
        return res + 1
    
    return 0

rn_udf = F.udf(lambda x, y: check_vals(x, y), IntegerType())

接下来，

from pyspark.sql.window import Window

# 我们想根据initial_value检查final_range的最终范围值
w = Window().orderBy(F.col('initial_range'))

# 首先将initial_range的下一行值放在名为nextRange的列中，以便我们可以进行比较
# 检查final_range+1 == nextRange，如果是，使用rn值，如果不是，然后使用rn并递增它以供下一次迭代使用。
# 现在在check_1列创建的分区中找到最大和最小值。
# 连接最小值和最大值
# 按ID排序以获取初始排序，我必须将其转换为整数，但您可能不需要它
# 删除所有计算的值
df.withColumn('nextRange', F.lead('initial_range').over(w)) \
    .withColumn('check_1', rn_udf("final_range", "nextRange")) \
    .withColumn('min_val', F.min("initial_range").over(Window.partitionBy("check_1"))) \
    .withColumn('max_val', F.max("final_range").over(Window.partitionBy("check_1"))) \
    .withColumn('range', F.concat("min_val", F.lit("-"), "max_val")) \
    .orderBy(F.col("ID").cast(IntegerType())) \
    .drop("nextRange", "check_1", "min_val", "max_val") \
    .show(truncate=False)

输出：

+---+-------------+-----------+---------------+
|ID |initial_range|final_range|range          |
+---+-------------+-----------+---------------+
|1  |1000000      |5999999    |1000000-6549999|
|2  |6000000      |6299999    |1000000-6549999|
|3  |6300000      |6399999    |1000000-6549999|
|4  |6400000      |6499999    |1000000-6549999|
|5  |6600000      |6699999    |6600000-6799999|
|6  |6700000      |6749999    |6600000-6799999|
|7  |6750000      |6799999    |6600000-6799999|
|8  |7000000      |7399999    |7000000-7399999|
|9  |7600000      |7699999    |7600000-7799999|
|10 |7700000      |7749999    |7600000-7799999|
|11 |7750000      |7799999    |7600000-7799999|
|12 |6500000      |6549999    |1000000-6549999|
+---+-------------+-----------+---------------+


<details>
<summary>英文:</summary>

Well this was a tough one, here is my answer,

First of all, I am using UDF so expect the performance to be a little bad,

import copy
import pyspark.sql.functions as F
from pyspark.sql.types import *

rn = 0

def check_vals(x, y):
global rn

if (y != None) and (int(x)+1) == int(y):
    return rn + 1
else:
    # Using copy to deepcopy and not forming a shallow one.
    res = copy.copy(rn)
    # Increment so that the next value with start form +1
    rn += 1
    # Return the same value as we want to group using this
    return res + 1

return 0

rn_udf = F.udf(lambda x, y: check_vals(x, y), IntegerType())


Next,

from pyspark.sql.window import Window

We want to check the final_range values according to the initial_value

w = Window().orderBy(F.col('initial_range'))

First of all take the next row values of initial range in a column called nextRange so that we can compare

Check if the final_range+1 == nextRange, if yes use rn value, if not then use rn and increment it for the next iteration.

Now find the max and min values in the partition created by the check_1 column.

Concat min and max values

order it by ID to get the initial ordering, I have to cast it to integer but you might not need it

drop all calculated values

df.withColumn('nextRange', F.lead('initial_range').over(w))
.withColumn('check_1', rn_udf("final_range", "nextRange"))
.withColumn('min_val', F.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', F.max("final_range").over(Window.partitionBy("check_1")))
.withColumn('range', F.concat("min_val", F.lit("-"), "max_val"))
.orderBy(F.col("ID").cast(IntegerType()))
.drop("nextRange", "check_1", "min_val", "max_val")
.show(truncate=False)


Output:

+---+-------------+-----------+---------------+
|ID |initial_range|final_range|range |
+---+-------------+-----------+---------------+
|1 |1000000 |5999999 |1000000-6549999|
|2 |6000000 |6299999 |1000000-6549999|
|3 |6300000 |6399999 |1000000-6549999|
|4 |6400000 |6499999 |1000000-6549999|
|5 |6600000 |6699999 |6600000-6799999|
|6 |6700000 |6749999 |6600000-6799999|
|7 |6750000 |6799999 |6600000-6799999|
|8 |7000000 |7399999 |7000000-7399999|
|9 |7600000 |7699999 |7600000-7799999|
|10 |7700000 |7749999 |7600000-7799999|
|11 |7750000 |7799999 |7600000-7799999|
|12 |6500000 |6549999 |1000000-6549999|
+---+-------------+-----------+---------------+


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在两列中对连续数值进行分组的PySpark代码：

问题

答案1

We want to check the final_range values according to the initial_value

First of all take the next row values of initial range in a column called nextRange so that we can compare

Check if the final_range+1 == nextRange, if yes use rn value, if not then use rn and increment it for the next iteration.

Now find the max and min values in the partition created by the check_1 column.

Concat min and max values

order it by ID to get the initial ordering, I have to cast it to integer but you might not need it

drop all calculated values

在Azure Databricks中的数据连接问题

基于PySpark DataFrame中的日期差异创建二进制指示列

每当我运行特定的代码时，我不断地收到以下错误信息：Azure Databricks

Pyspark 在数组<string> 上使用 DocumentAssembler。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论