问题

我没有访问实际数据集的权限，但我可以访问数据框的.describe()结果。我需要构建一些具有类似统计数据的数据。是否有一种方法可以生成符合这些统计数据的随机数据？即我理想情况下希望有一个data_generator(pd.DataFrame) -> pd.DataFrame，使得

data_gen(df.describe()).describe() == df.describe()

我理解这不会捕捉列之间的依赖关系，但我会尽力处理我能获取的内容。

英文:

I don't have access to the actual dataset, but I do have access to the results of the dataframes' .describe(). I need to construct some data that has similar statistics. Is there a way to generate random data that matches those statistics? ie I would ideally like a data_generator(pd.DataFrame) -> pd.DataFrame such that

data_gen(df.describe()).describe() == df.describe()

I understand that it won't capture the dependencies between columns but I will work with what I can get.

答案1

得分: 1

根据这个答案中的代码，您可以基于df的特征创建一个名为out_df的数据帧，尽管不包括四分位数。数据点（计数）越多，列的分布越接近匹配。

import pandas as pd
import numpy as np
import scipy.stats
def my_distribution(min_val, max_val, mean, std):
    scale = max_val - min_val
    location = min_val
    # 未缩放的贝塔分布的均值和标准差
    unscaled_mean = (mean - min_val) / scale
    unscaled_var = (std / scale) ** 2
    # 可以从均值和方差公式推导出alpha和beta的计算
    t = unscaled_mean / (1 - unscaled_mean)
    beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
    alpha = beta * t
    # 不是所有的参数都能产生有效的分布
    if alpha <= 0 or beta <= 0:
        raise ValueError('无法根据给定的参数创建分布。')
    # 使用计算得到的参数创建缩放后的贝塔分布
    return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
    data = my_distribution(desc.loc["min", col],
                           desc.loc["max", col],
                           desc.loc["mean", col],
                           desc.loc["std", col]).rvs(int(desc.loc["count", col]))
    out_df[col] = data
out_df.describe().sub(desc)

请注意，这是您提供的代码的翻译部分。

英文:

Adding to the code from this answer, you can create a dataframe out_df based on the characteristics of df - although not the quartiles. The higher the number of data points (count), the closer the distributions of columns will match.

import pandas as pd
import numpy as np
import scipy.stats
def my_distribution(min_val, max_val, mean, std):
    scale = max_val - min_val
    location = min_val
    # Mean and standard deviation of the unscaled beta distribution
    unscaled_mean = (mean - min_val) / scale
    unscaled_var = (std / scale) ** 2
    # Computation of alpha and beta can be derived from mean and variance formulas
    t = unscaled_mean / (1 - unscaled_mean)
    beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
    alpha = beta * t
    # Not all parameters may produce a valid distribution
    if alpha &lt;= 0 or beta &lt;= 0:
        raise ValueError(&#39;Cannot create distribution for the given parameters.&#39;)
    # Make scaled beta distribution with computed parameters
    return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
    data = my_distribution(desc.loc[&quot;min&quot;, col],
                           desc.loc[&quot;max&quot;, col],
                           desc.loc[&quot;mean&quot;, col],
                           desc.loc[&quot;std&quot;, col]).rvs(int(desc.loc[&quot;count&quot;, col]))
    out_df[col] = data
out_df.describe().sub(desc)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从 pandas 的 .describe() 结果生成数据

问题

答案1

在尝试通过将CSV读入多个块来连接Pandas数据帧时出现了ValueError。

pyspark 使用分隔符分割时出现错误（在高阶内部）？

可以使用光流进行视频的四分之一插值吗？

如何在Linux Mint上将Python 3.11.3降级到Python 3.9。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。