英文:
Generate data from the results of .describe() in pandas
问题
我没有访问实际数据集的权限,但我可以访问数据框的.describe()
结果。我需要构建一些具有类似统计数据的数据。是否有一种方法可以生成符合这些统计数据的随机数据?即我理想情况下希望有一个data_generator(pd.DataFrame) -> pd.DataFrame
,使得
data_gen(df.describe()).describe() == df.describe()
我理解这不会捕捉列之间的依赖关系,但我会尽力处理我能获取的内容。
英文:
I don't have access to the actual dataset, but I do have access to the results of the dataframes' .describe()
. I need to construct some data that has similar statistics. Is there a way to generate random data that matches those statistics? ie I would ideally like a data_generator(pd.DataFrame) -> pd.DataFrame
such that
data_gen(df.describe()).describe() == df.describe()
I understand that it won't capture the dependencies between columns but I will work with what I can get.
答案1
得分: 1
根据这个答案中的代码,您可以基于df
的特征创建一个名为out_df
的数据帧,尽管不包括四分位数。数据点(计数)越多,列的分布越接近匹配。
import pandas as pd
import numpy as np
import scipy.stats
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# 未缩放的贝塔分布的均值和标准差
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# 可以从均值和方差公式推导出alpha和beta的计算
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# 不是所有的参数都能产生有效的分布
if alpha <= 0 or beta <= 0:
raise ValueError('无法根据给定的参数创建分布。')
# 使用计算得到的参数创建缩放后的贝塔分布
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
data = my_distribution(desc.loc["min", col],
desc.loc["max", col],
desc.loc["mean", col],
desc.loc["std", col]).rvs(int(desc.loc["count", col]))
out_df[col] = data
out_df.describe().sub(desc)
请注意,这是您提供的代码的翻译部分。
英文:
Adding to the code from this answer, you can create a dataframe out_df
based on the characteristics of df
- although not the quartiles. The higher the number of data points (count), the closer the distributions of columns will match.
import pandas as pd
import numpy as np
import scipy.stats
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
desc = df.describe()
out_df = pd.DataFrame()
for col in desc.columns:
data = my_distribution(desc.loc["min", col],
desc.loc["max", col],
desc.loc["mean", col],
desc.loc["std", col]).rvs(int(desc.loc["count", col]))
out_df[col] = data
out_df.describe().sub(desc)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论