从 pandas 的 .describe() 结果生成数据

huangapple go评论71阅读模式
英文:

Generate data from the results of .describe() in pandas

问题

我没有访问实际数据集的权限,但我可以访问数据框的.describe()结果。我需要构建一些具有类似统计数据的数据。是否有一种方法可以生成符合这些统计数据的随机数据?即我理想情况下希望有一个data_generator(pd.DataFrame) -> pd.DataFrame,使得

data_gen(df.describe()).describe() == df.describe()

我理解这不会捕捉列之间的依赖关系,但我会尽力处理我能获取的内容。

英文:

I don't have access to the actual dataset, but I do have access to the results of the dataframes' .describe(). I need to construct some data that has similar statistics. Is there a way to generate random data that matches those statistics? ie I would ideally like a data_generator(pd.DataFrame) -> pd.DataFrame such that

data_gen(df.describe()).describe() == df.describe()

I understand that it won't capture the dependencies between columns but I will work with what I can get.

答案1

得分: 1

根据这个答案中的代码,您可以基于df的特征创建一个名为out_df的数据帧,尽管不包括四分位数。数据点(计数)越多,列的分布越接近匹配。

import pandas as pd
import numpy as np
import scipy.stats

def my_distribution(min_val, max_val, mean, std):
    scale = max_val - min_val
    location = min_val
    # 未缩放的贝塔分布的均值和标准差
    unscaled_mean = (mean - min_val) / scale
    unscaled_var = (std / scale) ** 2
    # 可以从均值和方差公式推导出alpha和beta的计算
    t = unscaled_mean / (1 - unscaled_mean)
    beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
    alpha = beta * t
    # 不是所有的参数都能产生有效的分布
    if alpha <= 0 or beta <= 0:
        raise ValueError('无法根据给定的参数创建分布。')
    # 使用计算得到的参数创建缩放后的贝塔分布
    return scipy.stats.beta(alpha, beta, scale=scale, loc=location)

desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
    data = my_distribution(desc.loc["min", col],
                           desc.loc["max", col],
                           desc.loc["mean", col],
                           desc.loc["std", col]).rvs(int(desc.loc["count", col]))
    out_df[col] = data

out_df.describe().sub(desc)

请注意,这是您提供的代码的翻译部分。

英文:

Adding to the code from this answer, you can create a dataframe out_df based on the characteristics of df - although not the quartiles. The higher the number of data points (count), the closer the distributions of columns will match.

import pandas as pd
import numpy as np
import scipy.stats

def my_distribution(min_val, max_val, mean, std):
    scale = max_val - min_val
    location = min_val
    # Mean and standard deviation of the unscaled beta distribution
    unscaled_mean = (mean - min_val) / scale
    unscaled_var = (std / scale) ** 2
    # Computation of alpha and beta can be derived from mean and variance formulas
    t = unscaled_mean / (1 - unscaled_mean)
    beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
    alpha = beta * t
    # Not all parameters may produce a valid distribution
    if alpha &lt;= 0 or beta &lt;= 0:
        raise ValueError(&#39;Cannot create distribution for the given parameters.&#39;)
    # Make scaled beta distribution with computed parameters
    return scipy.stats.beta(alpha, beta, scale=scale, loc=location)

desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
    data = my_distribution(desc.loc[&quot;min&quot;, col],
                           desc.loc[&quot;max&quot;, col],
                           desc.loc[&quot;mean&quot;, col],
                           desc.loc[&quot;std&quot;, col]).rvs(int(desc.loc[&quot;count&quot;, col]))
    out_df[col] = data

out_df.describe().sub(desc)

huangapple
  • 本文由 发表于 2023年5月29日 01:13:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352655.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定