2023年2月24日 05:14:01go评论111阅读模式

英文:

Python how to find duplicate

问题

When I try to use duplicated function does not work.

Here is my query example:

query = &quot;&quot;&quot;
SELECT
  variable1,
  variable2,
  variable3,
  variable4
FROM source.table
&quot;&quot;&quot;
df = spark.sql(query)

When I try to find dups using this function, it does not work:

# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated([&#39;variable1&#39;, &#39;variable2&#39;])]
print(&quot;Duplicate Rows based on variable1 and variable2:&quot;)
# Print the resultant Dataframe
duplicate

I got this error:

AttributeError: 'DataFrame' object has no attribute 'duplicated'

Do you know why? And how I can create a pd dataframe based on my current DF?

英文:

When I try to use duplicated function does not work.

Here is my query example:

query = &quot;&quot;&quot;
SELECT
  variable1,
  variable2,
  variable3,
  variable4
FROM source.table
&quot;&quot;&quot;
df = spark.sql(query)

When I try to find dups using this function, it does not work:

# Selecting duplicate rows based
# on list of column names
duplicate = df [df .duplicated([&#39;variable1&#39;, &#39;variable2&#39;])]
 
print(&quot;Duplicate Rows based on variable1 and variable2:&quot;)
 
# Print the resultant Dataframe
duplicate

I got this error:

AttributeError: 'DataFrame' object has no attribute 'duplicated'

Do you know why? And how I can create a pd dataframe based on my current DF?

答案1

得分: 0

Spark数据框架和pandas数据框架（以及pandas-on-spark数据框架）是不同的。如果您已安装并可用pandas，您可能会对将其转换为pandas-on-spark数据框架并使用pandas on spark duplicated感兴趣。我认为代码应该类似于以下内容：

import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated(['variable1', 'variable2'])]

英文:

Spark dataframes and pandas dataframes (and pandas-on-spark dataframes) are different. If you have pandas installed and available, you may be interested in converting to a pandas on spark dataframe and then using pandas on spark duplicated. I believe the code would be something like this:

import pyspark.pandas as ps
ps_df = df.to_pandas_on_spark()
ps_df[ps_df.duplicated([&#39;variable1&#39;, &#39;variable2&#39;])]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python如何找到重复项

问题

答案1

我正在尝试从网站上爬取图像，使用了Selenium，但在代码中出现了错误。

Calculando la puntuación de similitud en el clon de contexto.me.

There is a Python function that allows me to sum the last goals of a team in a dataframe.

multiprocessing: 两个Python Shell之间可以共享一个字典吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。