英文:
Python how to find duplicate
问题
When I try to use duplicated
function does not work.
Here is my query example:
query = """
SELECT
variable1,
variable2,
variable3,
variable4
FROM source.table
"""
df = spark.sql(query)
When I try to find dups using this function, it does not work:
# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated(['variable1', 'variable2'])]
print("Duplicate Rows based on variable1 and variable2:")
# Print the resultant Dataframe
duplicate
I got this error:
AttributeError: 'DataFrame' object has no attribute 'duplicated'
Do you know why? And how I can create a pd dataframe based on my current DF?
英文:
When I try to use duplicated
function does not work.
Here is my query example:
query = """
SELECT
variable1,
variable2,
variable3,
variable4
FROM source.table
"""
df = spark.sql(query)
When I try to find dups using this function, it does not work:
# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated(['variable1', 'variable2'])]
print("Duplicate Rows based on variable1 and variable2:")
# Print the resultant Dataframe
duplicate
I got this error:
AttributeError: 'DataFrame' object has no attribute 'duplicated'
Do you know why? And how I can create a pd dataframe based on my current DF?
答案1
得分: 0
Spark数据框架和pandas数据框架(以及pandas-on-spark数据框架)是不同的。如果您已安装并可用pandas,您可能会对将其转换为pandas-on-spark数据框架并使用pandas on spark duplicated感兴趣。我认为代码应该类似于以下内容:
import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated(['variable1', 'variable2'])]
英文:
Spark dataframes and pandas dataframes (and pandas-on-spark dataframes) are different. If you have pandas installed and available, you may be interested in converting to a pandas on spark dataframe and then using pandas on spark duplicated. I believe the code would be something like this:
import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated(['variable1', 'variable2'])]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论