2023年3月8日 15:13:55go评论63阅读模式

英文:

Perform multiple column lookups based on ID

问题

我有一个Pyspark数据框，其中包含多个列：

+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+

我想要执行以下操作：

对于id中的每个唯一条目，选择num中的所有值。
查找匹配所选num值的id中的行。
选择与所选num值对应的cat行，并计算不同值的出现次数。

例如：

id == 00111 和 num = [50012; 10131; 11001]
选择id == num的行

+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        50012|     00111|     a|
+-------------+----------+------+

选择cat = "a"的次数，这在这种情况下是3。
对于id中的每个唯一值重复执行。

输出将如下所示：

+-------------+----------+------+------+ 
|        id  |      num |cat   |cat   |
+-------------+----------+------+------+ 
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+

这是你所描述的操作的结果。

英文:

I have a pyspark dataframe with several columns:

+-------------+----------+------+
|.        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+

I would like to do the following:

for each unique entry in id select ALL values in num
look up the rows in id that match the selected values in num
select the rows in cat that correspond to the num values selected and count the number of occurrences of different values.

For example:

id==00111 and num = [50012; 10131; 11001]

select rows where id == num

+-------------+----------+------+
|.        id  |      num |cat   |
+-------------+----------+------+
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        50012|     00111|     a|
+-------------+----------+------+

select the number of times cat = "a" which would be 3 in this case.
repeat for each unique value in id

The output would look something like this:

+-------------+----------+------+------+
|.        id  |      num |cat   |cat   |
+-------------+----------+------+------+
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+

答案1

得分: 1

请尝试以下操作。首先，使用左连接将num映射到id，然后仅计算cat为"a"且具有相同id的情况。

w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]), 
              on=F.col('num') == F.col('id_right'), how='left')
      .select(*df.columns,
              F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
      .dropDuplicates())

结果：

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+

英文:

Try this.

First, map the num to id with left join, then count when only cat is "a" which has the same id.

w = Window.partitionBy(&#39;id&#39;)
df = (df.join(df.select(*[F.col(x).alias(f&#39;{x}_right&#39;) for x in df.columns]), 
              on=F.col(&#39;num&#39;) == F.col(&#39;id_right&#39;), how=&#39;left&#39;)
      .select(*df.columns,
              F.count(F.when(F.col(&#39;cat_right&#39;) == &#39;a&#39;, 1)).over(w).alias(&#39;cnt&#39;))
      .dropDuplicates())

Result

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+

答案2

得分: 0

使用自连接将 id 和 num 连接在一起，然后只计算 cat。

w = Window.partitionBy('id')

df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()

df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
  .show()

+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+

英文:

Join id and num by self, and just count the cat.

w = Window.partitionBy(&#39;id&#39;)
df1 = df.drop(&#39;cat&#39;)
df2 = df.drop(&#39;id&#39;).withColumnRenamed(&#39;num&#39;, &#39;id&#39;)
df3 = df1.join(df2, [&#39;id&#39;], &#39;inner&#39;)
df3.show()
df3.withColumn(&#39;cnt&#39;, f.count(f.when(f.col(&#39;cat&#39;) == f.lit(&#39;a&#39;), True)).over(w)) \
.show()
+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据ID执行多列查找。

问题

答案1

答案2

TypeError: 添加列到结构时，’Column’ 对象不可调用

需要在YARN上安装Spark才能从HDFS读取数据到PySpark吗？

Cannot establish SSL connection to cluster, getting SSLHandshakeException: "error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER"

如何在Spark中从JSON输入文件创建DataFrame？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论