英文:
Perform multiple column lookups based on ID
问题
我有一个Pyspark数据框,其中包含多个列:
+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+
我想要执行以下操作:
- 对于
id中的每个唯一条目,选择num中的所有值。 - 查找匹配所选
num值的id中的行。 - 选择与所选
num值对应的cat行,并计算不同值的出现次数。 
例如:
id == 00111和num = [50012; 10131; 11001]- 选择
id == num的行 
+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        50012|     00111|     a|
+-------------+----------+------+
- 
选择cat = "a"的次数,这在这种情况下是3。
 - 
对于
id中的每个唯一值重复执行。 
输出将如下所示:
+-------------+----------+------+------+ 
|        id  |      num |cat   |cat   |
+-------------+----------+------+------+ 
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+
这是你所描述的操作的结果。
英文:
I have a pyspark dataframe with several columns:
+-------------+----------+------+
|.        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+
I would like to do the following:
- for each unique entry in 
idselect ALL values innum - look up the rows in 
idthat match the selected values innum - select the rows in 
catthat correspond to thenumvalues selected and count the number of occurrences of different values. 
For example:
- 
id==00111andnum = [50012; 10131; 11001] - 
select rows where
id == num+-------------+----------+------+ |. id | num |cat | +-------------+----------+------+ | 10131| 71010| a| | 10131| 60010| c| | 11001| 53420| z| | 11001| 20011| a| | 11001| 00000| q| | 50012| 00111| a| +-------------+----------+------+ - 
select the number of times cat = "a" which would be 3 in this case.
 - 
repeat for each unique value in
id 
The output would look something like this:
+-------------+----------+------+------+
|.        id  |      num |cat   |cat   |
+-------------+----------+------+------+
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+
答案1
得分: 1
请尝试以下操作。首先,使用左连接将num映射到id,然后仅计算cat为"a"且具有相同id的情况。
w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]), 
              on=F.col('num') == F.col('id_right'), how='left')
      .select(*df.columns,
              F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
      .dropDuplicates())
结果:
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+
英文:
Try this.
First, map the num to id with left join, then count when only cat is "a" which has the same id.
w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]), 
              on=F.col('num') == F.col('id_right'), how='left')
      .select(*df.columns,
              F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
      .dropDuplicates())
Result
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+
答案2
得分: 0
使用自连接将 id 和 num 连接在一起,然后只计算 cat。
w = Window.partitionBy('id')
df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()
df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
  .show()
+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+
英文:
Join id and num by self, and just count the cat.
w = Window.partitionBy('id')
df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()
df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
.show()
+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论