2023年5月29日 17:04:17go评论70阅读模式

英文:

missing columns for ydata-profiling correlation report

问题

I'm using ydata-profiling（pandas-profiling的进化版）来计算大型数据集（例如400411行和27列）中列之间的相关性。
以下是config.yaml中的配置：

correlations:
    pearson:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    spearman:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    kendall:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    phi_k:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    cramers:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    auto:
       calculate: false
       warn_high_correlations: false
       threshold: 0.9

我只需要对数值数据使用Spearman方法，对分类数据使用Cramers' V方法。
当我执行以下操作时：

tmp_profiler = ydata_profiling.ProfileReport(df, config_file='config.yaml')

它可以正确计算Sperman，但在Cramers中会跳过许多分类列（对于相同大小的其他数据集，会跳过所有列）。

我认为这可能是因为存在大量缺失数据，因此我尝试将这些列中的NaN填充为空字符串，但并没有起作用。
我不认为这是由于某些配置造成的，因为我尝试了扩大所有值：

vars:
    cat:
        length: false
        characters: false
        words: false
        cardinality_threshold: 5000000
        n_obs: 5
        # 设置为零以禁用
        chi_squared_threshold: 0.0
        coerce_str_to_date: false
        redact: false
        histogram_largest: 10
        stop_words: []
...
# 对于分类数据
categorical_maximum_correlation_distinct: 10000000

report:
  precision: 1000

更新：即使我使用

tmp_profiler = ydata_profiling.ProfileReport(df.sample(100000), config_file='config.yaml')

问题依然存在。

有没有人对这种行为有解释和解决方案？

英文:

I'm using ydata-profiling (the evolution of pandas-profiling) to compute correlation among columns of large datasets (e.g. 400411 rows and 27 columns).
These are configurations in config.yaml:

correlations:
    pearson:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    spearman:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    kendall:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    phi_k:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    cramers:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    auto:
       calculate: false
       warn_high_correlations: false
       threshold: 0.9

I need only Spearman for numerical data and Cramers' V for categorical ones.
When I do

tmp_profiler = ydata_profiling.ProfileReport(df, config_file=&#39;config.yaml&#39;)

it computes correctly Sperman, but skips a lot of categorical columns in Cramers (with other datasets of same size it skips all of them).

I thought it was due to the presence of a lot of missing data, so I tried to fill Nan with empty string in those columns. It didn't work.
I don't think is due to some configuration, since I tried to enlarge all values:

vars:
    cat:
        length: false
        characters: false
        words: false
        cardinality_threshold: 5000000
        n_obs: 5
        # Set to zero to disable
        chi_squared_threshold: 0.0
        coerce_str_to_date: false
        redact: false
        histogram_largest: 10
        stop_words: []
...
# For categorical
categorical_maximum_correlation_distinct: 10000000

report:
  precision: 1000

UPDATE: even if I use

tmp_profiler = ydata_profiling.ProfileReport(df.sample(100000), config_file=&#39;config.yaml&#39;)

there is the same issue.

Does someone have some explanation and solutions for this behaviour?

答案1

得分: 0

缺失值的列是否被正确识别为分类列？问题的一个可能原因是类型推断。如果您的分类列具有高基数，它们可能被推断为文本或其他类型。

覆盖推断类型的解决方案：

prof = ProfileReport(
    df,
    config_file="config.yaml",
    type_schema={
        "column_1": "categorical",
        "column_2": "categorical",
    }
)
prof.to_file("profile.html")

如果将这些特征视为分类列，它们应该会出现在相关性中。也有可能您的列有太多缺失数据，以至于样本不返回任何有效数据...

英文:

Are the columns with missing values correctly identified as categorical? One cause of the problem could be type inference. If your categorical columns have high cardinality, they may be inferred as text or another type.

A solution to overwrite inferred types:

prof = ProfileReport(
    df,
    config_file=&quot;config.yaml&quot;,
    type_schema={
        &quot;column_1&quot;: &quot;categorical&quot;,
        &quot;column_2&quot;: &quot;categorical&quot;,
    }
)
prof.to_file(&quot;profile.html&quot;)

With the features being considered categorical, they should appear on the correlations. It is also possible that your columns have so much missing data that your sample does not return any valid data...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

缺失的列用于ydata-profiling相关性报告

问题

答案1

将图像数据框转换为pandas数据框。

Copying value from one column to another after filtering Dataframe – Simpler and shorter solution

Databricks Pyspark：如何获取外部MySQL中的表列表并创建数据框架？

在另一列的指定组中查找存在重复项的行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论