英文:
How can I query where column exists in another column?
问题
你可以使用Python中的Pandas库来完成这个任务。以下是一种方法:
import pandas as pd
# 创建示例数据框
data = {'page name': ['home', 'about'],
'page_list': [{'page_list': ['home', 'something']},
{'page_list': ['something']}]}
df = pd.DataFrame(data)
# 定义一个函数来检查页面是否出现在页面列表中
def check_page(page_name, page_list):
return page_name in page_list['page_list']
# 使用apply方法将函数应用于数据框的每一行
filtered_df = df[df.apply(lambda row: check_page(row['page name'], row['page_list']), axis=1)]
# 打印筛选后的数据框
print(filtered_df)
这段代码将筛选出包含在页面列表中的页面名称的行,并将其打印出来。
英文:
I have a dataframe that contains column with page names and another column which contains Json with page list. I would like to check if the page name appears in the page list and filter it if it doesn't.
How can I do it?
df for example:
+---------+--------------------------------+
|page name|page_list |
+---------+--------------------------------+
|home |{page_list:['home','something']}|
|about |{page_list:['something']} |
+---------+--------------------------------+
答案1
得分: 1
以下是您要翻译的内容:
假设您的DataFrame模式如下(这里的page_list
列是一个字符串):
df.printSchema()
#root
# |-- page_name: string (nullable = true)
# |-- page_list: string (nullable = true)
您可以使用from_json
将page_list
作为字符串数组获取。然后使用array_contains
检查page_name
是否在此列表中。
诀窍是您将不得不使用expr
将列值作为参数传递给array_contains
。
from pyspark.sql.types import StructType, StructField, ArrayType, StringType
from pyspark.sql.functions import expr, from_json
df.withColumn(
"flag",
from_json(
"page_list",
schema=StructType([StructField("page_list", ArrayType(StringType()))])
)["page_list"]
).withColumn(
"flag",
expr("array_contains(flag, page_name)")
).show(truncate=False)
#+---------+----------------------------------+-----+
#|page_name|page_list |flag |
#+---------+----------------------------------+-----+
#|home |{"page_list":["home","something"]}|true |
#|about |{"page_list":["something"]} |false|
#+---------+----------------------------------+-----+
请注意,这只是代码的翻译部分,没有其他内容。
英文:
Assuming that your DataFrame schema is like the following (here the page_list
column is a string):
df.printSchema()
#root
# |-- page_name: string (nullable = true)
# |-- page_list: string (nullable = true)
You can use from_json
to get the page_list
as an array of strings. Then use array_contains
to check if the page_name
is in this list.
The trick is that you will have to use expr
to pass a column value as a parameter to array_contains
.
from pyspark.sql.types import StructType, StructField, ArrayType, StringType
from pyspark.sql.functions import expr, from_json
df.withColumn(
"flag",
from_json(
"page_list",
schema=StructType([StructField("page_list", ArrayType(StringType()))])
)["page_list"]
).withColumn(
"flag",
expr("array_contains(flag, page_name)")
).show(truncate=False)
#+---------+----------------------------------+-----+
#|page_name|page_list |flag |
#+---------+----------------------------------+-----+
#|home |{"page_list":["home","something"]}|true |
#|about |{"page_list":["something"]} |false|
#+---------+----------------------------------+-----+
答案2
得分: 0
以下是翻译好的部分:
这是一种方法:
df2 = (df
.rdd
.map(lambda x: (x.page_name, x.page_list, x.page_name in x.page_list['page_list']))
.toDF(["page_name", "page_list", "flag"])
df2.show()
+---------+--------------------+-----+
|page_name| page_list| flag|
+---------+--------------------+-----+
| home|[page_list -> [ho...| true|
| about|[page_list -> [so...|false|
+---------+--------------------+-----+
英文:
Here's a way to do:
df2 = (df
.rdd
.map(lambda x: (x.page_name, x.page_list, x.page_name in x.page_list['page_list']))
.toDF(["page_name", "page_list", "flag"])
df2.show()
+---------+--------------------+-----+
|page_name| page_list| flag|
+---------+--------------------+-----+
| home|[page_list -> [ho...| true|
| about|[page_list -> [so...|false|
+---------+--------------------+-----+
答案3
得分: 0
如果列 page_list
的类型是字符串,你可以简单地使用 contains
函数,如下所示:
quoted_page_name = concat(lit("'"), col("page_name"), lit("'"))
df.withColumn("flag", col("page_list").contains(quoted_page_name)).show()
输出结果:
+---------+----------------------------------+-----+
|page_name|page_list |flag |
+---------+----------------------------------+-----+
|home |{page_list: ['home', 'something']}|true |
|about |{page_list: ['something']} |false|
+---------+----------------------------------+-----+
英文:
If the column page_list
is of type string, you could simply use contains
function like this:
quoted_page_name = concat(lit("'"), col("page_name"), lit("'"))
df.withColumn("flag", col("page_list").contains(quoted_page_name)).show()
Gives:
+---------+----------------------------------+-----+
|page_name|page_list |flag |
+---------+----------------------------------+-----+
|home |{page_list: ['home', 'something']}|true |
|about |{page_list: ['something']} |false|
+---------+----------------------------------+-----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论