根据ID执行多列查找。

huangapple go评论63阅读模式
英文:

Perform multiple column lookups based on ID

问题

我有一个Pyspark数据框,其中包含多个列:

+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+

我想要执行以下操作:

  1. 对于id中的每个唯一条目,选择num中的所有值。
  2. 查找匹配所选num值的id中的行。
  3. 选择与所选num值对应的cat行,并计算不同值的出现次数。

例如:

  1. id == 00111num = [50012; 10131; 11001]
  2. 选择id == num的行
+-------------+----------+------+
|        id  |      num |cat   |
+-------------+----------+------+
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        50012|     00111|     a|
+-------------+----------+------+
  1. 选择cat = "a"的次数,这在这种情况下是3。

  2. 对于id中的每个唯一值重复执行。

输出将如下所示:

+-------------+----------+------+------+ 
|        id  |      num |cat   |cat   |
+-------------+----------+------+------+ 
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+

这是你所描述的操作的结果。

英文:

I have a pyspark dataframe with several columns:

+-------------+----------+------+
|.        id  |      num |cat   |
+-------------+----------+------+
|        00111|     50012|     a|
|        00111|     10131|     a|
|        00111|     11001|     b| 
|        10131|     71010|     a|
|        10131|     60010|     c|
|        11001|     53420|     z|
|        11001|     20011|     a|
|        11001|     00000|     q|
|        13403|     33001|     a|
|        13403|     10023|     a|
|        50012|     00111|     a|
+-------------+----------+------+

I would like to do the following:

  1. for each unique entry in id select ALL values in num
  2. look up the rows in id that match the selected values in num
  3. select the rows in cat that correspond to the num values selected and count the number of occurrences of different values.

For example:

  1. id==00111 and num = [50012; 10131; 11001]

  2. select rows where id == num

    +-------------+----------+------+
    |.        id  |      num |cat   |
    +-------------+----------+------+
    |        10131|     71010|     a|
    |        10131|     60010|     c|
    |        11001|     53420|     z|
    |        11001|     20011|     a|
    |        11001|     00000|     q|
    |        50012|     00111|     a|
    +-------------+----------+------+
    
  3. select the number of times cat = "a" which would be 3 in this case.

  4. repeat for each unique value in id

The output would look something like this:

+-------------+----------+------+------+
|.        id  |      num |cat   |cat   |
+-------------+----------+------+------+
|        00111|     50012|     a|     3|
|        00111|     10131|     a|     3|
|        00111|     11001|     b|     3|
|        10131|     71010|     a|     1|
|        10131|     60010|     c|     1|
|        11001|     53420|     z|     0|
|        11001|     20011|     a|     0|
|        11001|     00000|     q|     0|
|        13403|     33001|     a|     0|
|        13403|     10023|     a|     0|
|        50012|     00111|     a|     2|
+-------------+----------+------+------+

答案1

得分: 1

请尝试以下操作。首先,使用左连接将num映射到id,然后仅计算cat为"a"且具有相同id的情况。

w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]), 
              on=F.col('num') == F.col('id_right'), how='left')
      .select(*df.columns,
              F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
      .dropDuplicates())

结果:

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+
英文:

Try this.

First, map the num to id with left join, then count when only cat is "a" which has the same id.

w = Window.partitionBy('id')
df = (df.join(df.select(*[F.col(x).alias(f'{x}_right') for x in df.columns]), 
              on=F.col('num') == F.col('id_right'), how='left')
      .select(*df.columns,
              F.count(F.when(F.col('cat_right') == 'a', 1)).over(w).alias('cnt'))
      .dropDuplicates())

Result

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  a|  3|
|00111|10131|  a|  3|
|00111|11001|  b|  3|
|10131|71010|  a|  0|
|10131|60010|  c|  0|
|11001|53420|  z|  0|
|11001|20011|  a|  0|
|11001|00000|  q|  0|
|13403|33001|  a|  0|
|13403|10023|  a|  0|
|50012|00111|  a|  2|
+-----+-----+---+---+

答案2

得分: 0

使用自连接将 idnum 连接在一起,然后只计算 cat

w = Window.partitionBy('id')

df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()

df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
  .show()

+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+

+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+
英文:

Join id and num by self, and just count the cat.

w = Window.partitionBy('id')
df1 = df.drop('cat')
df2 = df.drop('id').withColumnRenamed('num', 'id')
df3 = df1.join(df2, ['id'], 'inner')
df3.show()
df3.withColumn('cnt', f.count(f.when(f.col('cat') == f.lit('a'), True)).over(w)) \
.show()
+-----+-----+---+
|   id|  num|cat|
+-----+-----+---+
|00111|50012|  p|
|00111|10131|  p|
|00111|11001|  p|
|10131|71010|  a|
|10131|60010|  a|
|11001|53420|  b|
|11001|20011|  b|
|11001|00000|  b|
|50012|00111|  a|
+-----+-----+---+
+-----+-----+---+---+
|   id|  num|cat|cnt|
+-----+-----+---+---+
|00111|50012|  p|  0|
|00111|10131|  p|  0|
|00111|11001|  p|  0|
|10131|71010|  a|  2|
|10131|60010|  a|  2|
|11001|53420|  b|  0|
|11001|20011|  b|  0|
|11001|00000|  b|  0|
|50012|00111|  a|  1|
+-----+-----+---+---+

huangapple
  • 本文由 发表于 2023年3月8日 15:13:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670230.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定