如何查询一个列是否存在于另一个列中?

huangapple go评论100阅读模式
英文:

How can I query where column exists in another column?

问题

你可以使用Python中的Pandas库来完成这个任务。以下是一种方法:

import pandas as pd

# 创建示例数据框
data = {'page name': ['home', 'about'],
        'page_list': [{'page_list': ['home', 'something']},
                     {'page_list': ['something']}]}
df = pd.DataFrame(data)

# 定义一个函数来检查页面是否出现在页面列表中
def check_page(page_name, page_list):
    return page_name in page_list['page_list']

# 使用apply方法将函数应用于数据框的每一行
filtered_df = df[df.apply(lambda row: check_page(row['page name'], row['page_list']), axis=1)]

# 打印筛选后的数据框
print(filtered_df)

这段代码将筛选出包含在页面列表中的页面名称的行,并将其打印出来。

英文:

I have a dataframe that contains column with page names and another column which contains Json with page list. I would like to check if the page name appears in the page list and filter it if it doesn't.

How can I do it?

df for example:

+---------+--------------------------------+
|page name|page_list                       |
+---------+--------------------------------+
|home     |{page_list:['home','something']}|
|about    |{page_list:['something']}       |
+---------+--------------------------------+

答案1

得分: 1

以下是您要翻译的内容:

假设您的DataFrame模式如下(这里的page_list列是一个字符串):

df.printSchema()
#root
# |-- page_name: string (nullable = true)
# |-- page_list: string (nullable = true)

您可以使用from_jsonpage_list作为字符串数组获取。然后使用array_contains检查page_name是否在此列表中。

诀窍是您将不得不使用expr将列值作为参数传递给array_contains

from pyspark.sql.types import StructType, StructField, ArrayType, StringType
from pyspark.sql.functions import expr, from_json

df.withColumn(
    "flag",
    from_json(
        "page_list", 
        schema=StructType([StructField("page_list", ArrayType(StringType()))])
    )["page_list"]
).withColumn(
    "flag",
    expr("array_contains(flag, page_name)")
).show(truncate=False)
#+---------+----------------------------------+-----+
#|page_name|page_list                         |flag |
#+---------+----------------------------------+-----+
#|home     |{"page_list":["home","something"]}|true |
#|about    |{"page_list":["something"]}       |false|
#+---------+----------------------------------+-----+

请注意,这只是代码的翻译部分,没有其他内容。

英文:

Assuming that your DataFrame schema is like the following (here the page_list column is a string):

df.printSchema()
#root
# |-- page_name: string (nullable = true)
# |-- page_list: string (nullable = true)

You can use from_json to get the page_list as an array of strings. Then use array_contains to check if the page_name is in this list.

The trick is that you will have to use expr to pass a column value as a parameter to array_contains.

from pyspark.sql.types import StructType, StructField, ArrayType, StringType
from pyspark.sql.functions import expr, from_json

df.withColumn(
    "flag",
    from_json(
        "page_list", 
        schema=StructType([StructField("page_list", ArrayType(StringType()))])
    )["page_list"]
).withColumn(
    "flag",
    expr("array_contains(flag, page_name)")
).show(truncate=False)
#+---------+----------------------------------+-----+
#|page_name|page_list                         |flag |
#+---------+----------------------------------+-----+
#|home     |{"page_list":["home","something"]}|true |
#|about    |{"page_list":["something"]}       |false|
#+---------+----------------------------------+-----+

答案2

得分: 0

以下是翻译好的部分:

这是一种方法

    df2 = (df
          .rdd
          .map(lambda x: (x.page_name, x.page_list, x.page_name in x.page_list['page_list']))
          .toDF(["page_name", "page_list", "flag"])

    df2.show()

    +---------+--------------------+-----+
    |page_name|           page_list| flag|
    +---------+--------------------+-----+
    |     home|[page_list -> [ho...| true|
    |    about|[page_list -> [so...|false|
    +---------+--------------------+-----+
英文:

Here's a way to do:

df2 = (df
      .rdd
      .map(lambda x: (x.page_name, x.page_list, x.page_name in x.page_list['page_list']))
      .toDF(["page_name", "page_list", "flag"])

df2.show()

+---------+--------------------+-----+
|page_name|           page_list| flag|
+---------+--------------------+-----+
|     home|[page_list -> [ho...| true|
|    about|[page_list -> [so...|false|
+---------+--------------------+-----+

答案3

得分: 0

如果列 page_list 的类型是字符串,你可以简单地使用 contains 函数,如下所示:

quoted_page_name = concat(lit("'"), col("page_name"), lit("'"))
df.withColumn("flag", col("page_list").contains(quoted_page_name)).show()

输出结果:

+---------+----------------------------------+-----+
|page_name|page_list                         |flag |
+---------+----------------------------------+-----+
|home     |{page_list: ['home', 'something']}|true |
|about    |{page_list: ['something']}        |false|
+---------+----------------------------------+-----+
英文:

If the column page_list is of type string, you could simply use contains function like this:

quoted_page_name = concat(lit("'"), col("page_name"), lit("'"))    
df.withColumn("flag", col("page_list").contains(quoted_page_name)).show()

Gives:

+---------+----------------------------------+-----+
|page_name|page_list                         |flag |
+---------+----------------------------------+-----+
|home     |{page_list: ['home', 'something']}|true |
|about    |{page_list: ['something']}        |false|
+---------+----------------------------------+-----+

huangapple
  • 本文由 发表于 2020年1月6日 22:13:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/59613621.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定