2023年6月8日 16:38:57go评论68阅读模式

英文:

Data Profiling using Pyspark

问题

我正在尝试创建一个可以接受DataFrame作为输入并返回数据概要报告的PySpark函数。我已经使用了describe和summary函数，这些函数会输出最小值、最大值、计数等结果。但我需要一个详细的报告，例如唯一值，并且需要一些可视化。

如果有人知道任何可以帮助的信息，请随时在下面评论。

一个能够提供上述所需输出的动态函数将会很有帮助。

英文:

I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. but I need a detailed report like unique_values and have some visuals too.

If anyone knows anything that can help, feel free to comment below.

A dynamic function that can give the desired output as mentioned above will be helpful.

答案1

得分: 1

选项1：

如果Spark DataFrame不太大，您可以尝试使用像sweetviz这样的Pandas分析库，例如：

import sweetviz as sv

my_report = sv.analyze(source=(data.toPandas(), "EDA Report"))
my_report.show_notebook() # 在笔记本单元格中显示
my_report.show_html(filepath="report.html") # 生成报告到HTML文件

它看起来像这样：

您可以在这里查看有关sweetviz的更多功能，例如如何比较不同数据集。

选项2：

使用支持pyspark.sql.DataFrame的分析工具，例如ydata-profiling。

英文:

Option 1:

If the spark dataframe is not to big you can try using a pandas profiling library like sweetviz, e.g.:

import sweetviz as sv

my_report = sv.analyze(source=(data.toPandas(), &quot;EDA Report&quot;))
my_report.show_notebook() # to show in a notebook cell
my_report.show_html(filepath=&quot;report.html&quot;) # Will generate the report into a html file

It looks like:

You can check more features about sweetviz here like how to compare populations.

Option 2:

Use a profiler that admits pyspark.sql.DataFrame, e.g. ydata-profiling.

答案2

得分: 0

ydata-profiling 目前支持 Spark 数据框，因此它应该是最合适的选择：

from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport

spark = SparkSession \
    .builder \
    .appName("Python Spark profiling example") \
    .getOrCreate()

df = spark.read.csv("{插入CSV文件路径}")
df.printSchema()

report = ProfileReport(df, title="Profiling pyspark DataFrame")
report.to_file('profile.html')

一个示例报告如下：https://ydata-profiling.ydata.ai/examples/master/census/census_report.html

英文:

ydata-profiling currently support Spark dataframes, so it should be the most adequate choice:

from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport

spark = SparkSession \
    .builder \
    .appName(&quot;Python Spark profiling example&quot;) \
    .getOrCreate()

df = spark.read.csv(&quot;{insert-csv-file-path}&quot;)
df.printSchema()

report = ProfileReport(df, title=”Profiling pyspark DataFrame”)
report.to_file(&#39;profile.html&#39;)

An example report looks like this: https://ydata-profiling.ydata.ai/examples/master/census/census_report.html

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Data Profiling using Pyspark

问题

答案1

答案2

使用条件和未命名列的值访问数据框的行。

在R中添加数据框列，该列包含预定单词的频率计数。

Pandas根据条件拆分DataFrame列，并写回该列。

Spark 2.3 中的 Pyspark 序列等效部分

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论