Python如何找到重复项

huangapple go评论87阅读模式
英文:

Python how to find duplicate

问题

When I try to use duplicated function does not work.

Here is my query example:

query = """
SELECT
  variable1,
  variable2,
  variable3,
  variable4

FROM source.table
"""
df = spark.sql(query)

When I try to find dups using this function, it does not work:

# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated(['variable1', 'variable2'])]

print("Duplicate Rows based on variable1 and variable2:")

# Print the resultant Dataframe
duplicate

I got this error:

AttributeError: 'DataFrame' object has no attribute 'duplicated'

Do you know why? And how I can create a pd dataframe based on my current DF?

英文:

When I try to use duplicated function does not work.

Here is my query example:

query = """
SELECT
  variable1,
  variable2,
  variable3,
  variable4

FROM source.table
"""
df = spark.sql(query)

When I try to find dups using this function, it does not work:

# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated(['variable1', 'variable2'])]
 
print("Duplicate Rows based on variable1 and variable2:")
 
# Print the resultant Dataframe
duplicate

I got this error:

AttributeError: 'DataFrame' object has no attribute 'duplicated'

Do you know why? And how I can create a pd dataframe based on my current DF?

答案1

得分: 0

Spark数据框架和pandas数据框架(以及pandas-on-spark数据框架)是不同的。如果您已安装并可用pandas,您可能会对将其转换为pandas-on-spark数据框架并使用pandas on spark duplicated感兴趣。我认为代码应该类似于以下内容:

import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated(['variable1', 'variable2'])]
英文:

Spark dataframes and pandas dataframes (and pandas-on-spark dataframes) are different. If you have pandas installed and available, you may be interested in converting to a pandas on spark dataframe and then using pandas on spark duplicated. I believe the code would be something like this:

import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated(['variable1', 'variable2'])]

huangapple
  • 本文由 发表于 2023年2月24日 05:14:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550387.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定