英文:
Data Profiling using Pyspark
问题
我正在尝试创建一个可以接受DataFrame作为输入并返回数据概要报告的PySpark函数。我已经使用了describe和summary函数,这些函数会输出最小值、最大值、计数等结果。但我需要一个详细的报告,例如唯一值,并且需要一些可视化。
如果有人知道任何可以帮助的信息,请随时在下面评论。
一个能够提供上述所需输出的动态函数将会很有帮助。
英文:
I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. but I need a detailed report like unique_values and have some visuals too.
If anyone knows anything that can help, feel free to comment below.
A dynamic function that can give the desired output as mentioned above will be helpful.
答案1
得分: 1
选项1:
如果Spark DataFrame不太大,您可以尝试使用像sweetviz
这样的Pandas分析库,例如:
import sweetviz as sv
my_report = sv.analyze(source=(data.toPandas(), "EDA Report"))
my_report.show_notebook() # 在笔记本单元格中显示
my_report.show_html(filepath="report.html") # 生成报告到HTML文件
它看起来像这样:
您可以在这里查看有关sweetviz的更多功能,例如如何比较不同数据集。
选项2:
使用支持pyspark.sql.DataFrame
的分析工具,例如ydata-profiling
。
英文:
- Option 1:
If the spark dataframe is not to big you can try using a pandas profiling library like sweetviz
, e.g.:
import sweetviz as sv
my_report = sv.analyze(source=(data.toPandas(), "EDA Report"))
my_report.show_notebook() # to show in a notebook cell
my_report.show_html(filepath="report.html") # Will generate the report into a html file
It looks like:
You can check more features about sweetviz here like how to compare populations.
Option 2:
Use a profiler that admits pyspark.sql.DataFrame
, e.g. ydata-profiling
.
答案2
得分: 0
ydata-profiling 目前支持 Spark 数据框,因此它应该是最合适的选择:
from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport
spark = SparkSession \
.builder \
.appName("Python Spark profiling example") \
.getOrCreate()
df = spark.read.csv("{插入CSV文件路径}")
df.printSchema()
report = ProfileReport(df, title="Profiling pyspark DataFrame")
report.to_file('profile.html')
一个示例报告如下:https://ydata-profiling.ydata.ai/examples/master/census/census_report.html
英文:
ydata-profiling currently support Spark dataframes, so it should be the most adequate choice:
from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport
spark = SparkSession \
.builder \
.appName("Python Spark profiling example") \
.getOrCreate()
df = spark.read.csv("{insert-csv-file-path}")
df.printSchema()
report = ProfileReport(df, title=”Profiling pyspark DataFrame”)
report.to_file('profile.html')
An example report looks like this: https://ydata-profiling.ydata.ai/examples/master/census/census_report.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论