英文:
Perform multiple column lookups based on ID
问题
我有一个Pyspark数据框,其中包含多个列:
+-------------+----------+------+
| id | num |cat |
+-------------+----------+------+
| 00111| 50012| a|
| 00111| 10131| a|
| 00111| 11001| b|
| 10131| 71010| a|
| 10131| 60010| c|
| 11001| 53420| z|
| 11001| 20011| a|
| 11001| 00000| q|
| 13403| 33001| a|
| 13403| 10023| a|
| 50012| 00111| a|
+-------------+----------+------+
我想要执行以下操作:
- 对于
id
中的每个唯一条目,选择num
中的所有值。 - 查找匹配所选
num
值的id
中的行。 - 选择与所选
num
值对应的cat
行,并计算不同值的出现次数。
例如:
id == 00111
和num = [50012; 10131; 11001]
- 选择
id == num
的行
+-------------+----------+------+
| id | num |cat |
+-------------+----------+------+
| 10131| 71010| a|
| 10131| 60010| c|
| 11001| 53420| z|
| 11001| 20011| a|
| 11001| 00000| q|
| 50012| 00111| a|
+-------------+----------+------+
-
选择cat = "a"的次数,这在这种情况下是3。
-
对于
id
中的每个唯一值重复执行。
输出将如下所示:
+-------------+----------+------+------+
| id | num |cat |cat |
+-------------+----------+------+------+
| 00111| 50012| a| 3|
| 00111| 10131| a| 3|
| 00111| 11001| b| 3|
| 10131| 71010| a| 1|
| 10131| 60010| c| 1|
| 11001| 53420| z| 0|
| 11001| 20011| a| 0|
| 11001| 00000| q| 0|
| 13403| 33001| a| 0|
| 13403| 10023| a| 0|
| 50012| 00111| a| 2|
+-------------+----------+------+------+
这是你所描述的操作的结果。
英文:
I have a pyspark dataframe with several columns:
+-------------+----------+------+
|. id | num |cat |
+-------------+----------+------+
| 00111| 50012| a|
| 00111| 10131| a|
| 00111| 11001| b|
| 10131| 71010| a|
| 10131| 60010| c|
| 11001| 53420| z|
| 11001| 20011| a|
| 11001| 00000| q|
| 13403| 33001| a|
| 13403| 10023| a|
| 50012| 00111| a|
+-------------+----------+------+
I would like to do the following:
- for each unique entry in
id
select ALL values innum
- look up the rows in
id
that match the selected values innum
- select the rows in
cat
that correspond to thenum
values selected and count the number of occurrences of different values.
For example:
-
id==00111
andnum = [50012; 10131; 11001]
-
select rows where
id == num
+-------------+----------+------+ |. id | num |cat | +-------------+----------+------+ | 10131| 71010| a| | 10131| 60010| c| | 11001| 53420| z| | 11001| 20011| a| | 11001| 00000| q| | 50012| 00111| a| +-------------+----------+------+
-
select the number of times cat = "a" which would be 3 in this case.
-
repeat for each unique value in
id
The output would look something like this:
+-------------+----------+------+------+
|. id | num |cat |cat |
+-------------+----------+------+------+
| 00111| 50012| a| 3|
| 00111| 10131| a| 3|
| 00111| 11001| b| 3|
| 10131| 71010| a| 1|
| 10131| 60010| c| 1|
| 11001| 53420| z| 0|
| 11001| 20011| a| 0|
| 11001| 00000| q| 0|
| 13403| 33001| a| 0|
| 13403| 10023| a| 0|
| 50012| 00111| a| 2|
+-------------+----------+------+------+
答案1
得分: 1
请尝试以下操作。首先,使用左连接将num
映射到id
,然后仅计算cat
为"a"且具有相同id
的情况。
w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]),
on=F.col('num') == F.col('id_right'), how='left')
.select(*df.columns,
F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
.dropDuplicates())
结果:
+-----+-----+---+---+
| id| num|cat|cnt|
+-----+-----+---+---+
|00111|50012| a| 3|
|00111|10131| a| 3|
|00111|11001| b| 3|
|10131|71010| a| 0|
|10131|60010| c| 0|
|11001|53420| z| 0|
|11001|20011| a| 0|
|11001|00000| q| 0|
|13403|33001| a| 0|
|13403|10023| a| 0|
|50012|00111| a| 2|
+-----+-----+---+---+
英文:
Try this.
First, map the num
to id
with left join, then count when only cat is "a" which has the same id.
w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]),
on=F.col('num') == F.col('id_right'), how='left')
.select(*df.columns,
F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
.dropDuplicates())
Result
+-----+-----+---+---+
| id| num|cat|cnt|
+-----+-----+---+---+
|00111|50012| a| 3|
|00111|10131| a| 3|
|00111|11001| b| 3|
|10131|71010| a| 0|
|10131|60010| c| 0|
|11001|53420| z| 0|
|11001|20011| a| 0|
|11001|00000| q| 0|
|13403|33001| a| 0|
|13403|10023| a| 0|
|50012|00111| a| 2|
+-----+-----+---+---+
答案2
得分: 0
使用自连接将 id
和 num
连接在一起,然后只计算 cat
。
w = Window.partitionBy('id')
df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()
df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
.show()
+-----+-----+---+
| id| num|cat|
+-----+-----+---+
|00111|50012| p|
|00111|10131| p|
|00111|11001| p|
|10131|71010| a|
|10131|60010| a|
|11001|53420| b|
|11001|20011| b|
|11001|00000| b|
|50012|00111| a|
+-----+-----+---+
+-----+-----+---+---+
| id| num|cat|cnt|
+-----+-----+---+---+
|00111|50012| p| 0|
|00111|10131| p| 0|
|00111|11001| p| 0|
|10131|71010| a| 2|
|10131|60010| a| 2|
|11001|53420| b| 0|
|11001|20011| b| 0|
|11001|00000| b| 0|
|50012|00111| a| 1|
+-----+-----+---+---+
英文:
Join id
and num
by self, and just count the cat
.
w = Window.partitionBy('id')
df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()
df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
.show()
+-----+-----+---+
| id| num|cat|
+-----+-----+---+
|00111|50012| p|
|00111|10131| p|
|00111|11001| p|
|10131|71010| a|
|10131|60010| a|
|11001|53420| b|
|11001|20011| b|
|11001|00000| b|
|50012|00111| a|
+-----+-----+---+
+-----+-----+---+---+
| id| num|cat|cnt|
+-----+-----+---+---+
|00111|50012| p| 0|
|00111|10131| p| 0|
|00111|11001| p| 0|
|10131|71010| a| 2|
|10131|60010| a| 2|
|11001|53420| b| 0|
|11001|20011| b| 0|
|11001|00000| b| 0|
|50012|00111| a| 1|
+-----+-----+---+---+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论