英文:
How to create synthetic data based on real data?
问题
我想基于真实数据生成合成数据。
数据样本:
session_id | session_date_time | session_status | mentor_domain_id | mentor_id | reg_date_mentor | region_id_mentor | mentee_id | reg_date_mentee | region_id_mentee | |
---|---|---|---|---|---|---|---|---|---|---|
5528 | 9165 | 2022-09-03 00:00:00 | finished | 5 | 20410 | 2022-04-28 00:00:00 | 6 | 11557 | 2021-05-15 00:00:00 | 3 |
2370 | 3891 | 2022-05-30 00:00:00 | canceled | 1 | 20879 | 2021-10-07 00:00:00 | 1 | 10154 | 2022-05-22 00:00:00 | 1 |
6473 | 10683 | 2022-09-15 00:00:00 | finished | 2 | 21457 | 2022-01-13 00:00:00 | 1 | 14505 | 2022-09-11 00:00:00 | 1 |
1671 | 2754 | 2022-04-22 00:00:00 | canceled | 6 | 21851 | 2021-08-24 00:00:00 | 1 | 13579 | 2021-09-12 00:00:00 | 2 |
324 | 527 | 2021-10-30 00:00:00 | finished | 1 | 22243 | 2021-07-04 00:00:00 | 1 | 14096 | 2021-10-10 00:00:00 | 10 |
4500 | 7453 | 2022-08-13 00:00:00 | finished | 4 | 22199 | 2021-12-02 00:00:00 | 5 | 11743 | 2021-11-01 00:00:00 | 8 |
2356 | 3875 | 2022-05-29 00:00:00 | finished | 2 | 21434 | 2022-04-29 00:00:00 | 4 | 14960 | 2021-12-12 00:00:00 | 0 |
2722 | 4491 | 2022-06-16 00:00:00 | finished | 2 | 21462 | 2022-06-05 00:00:00 | 7 | 12627 | 2021-02-23 00:00:00 | 2 |
6016 | 9929 | 2022-09-10 00:00:00 | finished | 1 | 20802 | 2021-08-07 00:00:00 | 1 | 10121 | 2022-07-30 00:00:00 | 1 |
4899 | 8121 | 2022-08-22 00:00:00 | finished | 1 | 24920 | 2021-10-19 00:00:00 | 5 | 12223 | 2022-07-04 00:00:00 | 4 |
这些数据是来自数据库的合并表格。我在我的项目中使用了它。
我已经有了许多SQL查询,一些关联矩阵以及一个非线性回归模型。
首先,我需要创建具有类似属性的新数据(因为我不能在我的投资组合案例中使用原始数据)。如果有办法生成更长时间段的数据,那将很棒。
我应该从哪里开始?我能用sklearn.datasets解决这个问题吗?
附注:我已经尝试过合成数据库,但失败了。我不能使用Faker,因为我需要保留数据结构。
英文:
I want to make synthetic data based on real data.
Data sample:
session_id | session_date_time | session_status | mentor_domain_id | mentor_id | reg_date_mentor | region_id_mentor | mentee_id | reg_date_mentee | region_id_mentee | |
---|---|---|---|---|---|---|---|---|---|---|
5528 | 9165 | 2022-09-03 00:00:00 | finished | 5 | 20410 | 2022-04-28 00:00:00 | 6 | 11557 | 2021-05-15 00:00:00 | 3 |
2370 | 3891 | 2022-05-30 00:00:00 | canceled | 1 | 20879 | 2021-10-07 00:00:00 | 1 | 10154 | 2022-05-22 00:00:00 | 1 |
6473 | 10683 | 2022-09-15 00:00:00 | finished | 2 | 21457 | 2022-01-13 00:00:00 | 1 | 14505 | 2022-09-11 00:00:00 | 1 |
1671 | 2754 | 2022-04-22 00:00:00 | canceled | 6 | 21851 | 2021-08-24 00:00:00 | 1 | 13579 | 2021-09-12 00:00:00 | 2 |
324 | 527 | 2021-10-30 00:00:00 | finished | 1 | 22243 | 2021-07-04 00:00:00 | 1 | 14096 | 2021-10-10 00:00:00 | 10 |
4500 | 7453 | 2022-08-13 00:00:00 | finished | 4 | 22199 | 2021-12-02 00:00:00 | 5 | 11743 | 2021-11-01 00:00:00 | 8 |
2356 | 3875 | 2022-05-29 00:00:00 | finished | 2 | 21434 | 2022-04-29 00:00:00 | 4 | 14960 | 2021-12-12 00:00:00 | 0 |
2722 | 4491 | 2022-06-16 00:00:00 | finished | 2 | 21462 | 2022-06-05 00:00:00 | 7 | 12627 | 2021-02-23 00:00:00 | 2 |
6016 | 9929 | 2022-09-10 00:00:00 | finished | 1 | 20802 | 2021-08-07 00:00:00 | 1 | 10121 | 2022-07-30 00:00:00 | 1 |
4899 | 8121 | 2022-08-22 00:00:00 | finished | 1 | 24920 | 2021-10-19 00:00:00 | 5 | 12223 | 2022-07-04 00:00:00 | 4 |
This data is merged tables from database.
I used it for my project.
I got many many SQL queries, few correlation matrix for this data and one non linear regression model.
First of all I need to make new data with similar properties (I can't use original data for my portfolio case).
And it will be great if there's the way to generate data for longer time period.
Where should I start?
Can I solve this problem with sklearn.datasets?
PS I already tryed Synthetic Data Vault and have failed.
I can't use Faker, because I need to keep data structure.
答案1
得分: 1
这是最好的SDG项目,具有图形用户界面:https://github.com/ydataai/ydata-synthetic/
英文:
This is the best SDG project out there and has a GUI: https://github.com/ydataai/ydata-synthetic/
答案2
得分: 0
以下是您提供的代码的中文翻译:
我不确定这是否符合您的要求,但以下是使用Faker创建符合特定条件的样本数据的方法。
from faker import Faker
import pandas as pd
dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
df1
这将创建一个数据框(df)如下所示:
session_id session_date_time session_status mentor_domain_id mentor_id Reg_date_mentor mentor_mentee_id
0 2030 2022-04-27 Canceled 24 2546 2003-08-21 77
1 4721 2022-01-29 Canceled 26 1205 2003-09-11 60
2 4208 2022-11-15 Canceled 5 1718 2010-08-10 38
3 1220 2022-02-11 Canceled 16 2864 2008-07-30 41
4 4268 2022-05-12 Canceled 30 2160 2009-08-20 67
5 3942 2022-06-02 Canceled 12 1776 2003-11-18 73
6 2229 2022-03-13 Canceled 20 2250 2003-12-28 37
7 1696 2022-06-07 Finished 31 2268 2010-06-04 44
8 3898 2022-11-03 Finished 9 1331 2012-01-08 23
9 3761 2022-11-14 Canceled 29 1682 2008-09-09 47
您可以根据您的具体需求进一步自定义数据并创建列与另一列之间的依赖关系。
<details>
<summary>英文:</summary>
I am not positive this is what you are looking for, but here is a way to use Faker to create sample data that conforms to specific criterion.
from faker import Faker
import pandas as pd
dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
df1
This will create a df of the form:
session_id session_date_time session_status mentor_domain_id mentor_id Reg_date_mentor mentor_mentee_id
0 2030 2022-04-27 Canceled 24 2546 2003-08-21 77
1 4721 2022-01-29 Canceled 26 1205 2003-09-11 60
2 4208 2022-11-15 Canceled 5 1718 2010-08-10 38
3 1220 2022-02-11 Canceled 16 2864 2008-07-30 41
4 4268 2022-05-12 Canceled 30 2160 2009-08-20 67
5 3942 2022-06-02 Canceled 12 1776 2003-11-18 73
6 2229 2022-03-13 Canceled 20 2250 2003-12-28 37
7 1696 2022-06-07 Finished 31 2268 2010-06-04 44
8 3898 2022-11-03 Finished 9 1331 2012-01-08 23
9 3761 2022-11-14 Canceled 29 1682 2008-09-09 47
You can further customize data and create reliance between data in one column with another, depending on you specific needs.
</details>
# 答案3
**得分**: 0
我通过Synthetic Data Vault的GaussianCopulaSynthesizer创建新数据。
我为某些列添加了一些预定义的约束类,并运行条件抽样以保持原始数据集的属性。
```python
# 为数据集创建元数据(这不是必需的步骤,因为元数据会自动检测)。
# 我已经更新了每一列的元数据
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(
column_name='session_id',
sdtype='id',
regex_format='[0-9]{6}')
metadata.validate()
# 创建合成器(这个合成器对我的数据效果更好):
distributions = {
'reg_date_mentee': 'uniform',
'mentee_id': 'uniform'
}
synthesizer = GaussianCopulaSynthesizer(
metadata,
numerical_distributions=distributions)
# 为合成器添加约束(添加规则,数据中的每一行都必须遵循)。
# 我为大多数列添加了约束。
my_constraint_mentee_id = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'mentee_id',
'low_value': 20001,
'high_value': 21847,
'strict_boundaries': False
}
}
synthesizer.add_constraints(constraints=[
my_constraint_mentee_id
])
# 拟合合成器;
synthesizer.fit(sessions_and_users1)
# 创建条件列表;
# 通过sdv.sampling中的Condition创建所需的条件。
# 所有条件都保存在列表中。
# 使用条件生成带条件的数据样本。
synthetic_data_with_conditions = synthesizer.sample_from_conditions(
conditions=conditions)
我不会添加完整的代码,因为会占用太多空间。
英文:
I make new data by GaussianCopulaSynthesizer from Synthetic Data Vault.
I add some Predefined Constraint Classes for some columns and run conditional sampling to keep properties of original dataset.
# create metadata for dataset (it's not required step, cause metadata detects automatically).
# I had updated metadata for every column
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(
column_name='session_id',
sdtype='id',
regex_format='[0-9]{6}')
metadata.validate()
# create synthesizer (this synthesizer works better for my data):
distributions = {
'reg_date_mentee': 'uniform',
'mentee_id': 'uniform'
}
synthesizer = GaussianCopulaSynthesizer(
metadata,
numerical_distributions=distributions)
# add constraints to synthesizer (adding rules that every row in the data must follow).
# I add constraints for most columns.
my_constraint_mentee_id = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'mentee_id',
'low_value': 20001,
'high_value': 21847,
'strict_boundaries': False
}
}
synthesizer.add_constraints(constraints=[
my_constraint_mentee_id
])
# fit synthesizer;
synthesizer.fit(sessions_and_users1)
# make list of conditions;
# Make conditions you need by Condition from sdv.sampling.
# All conditions keeping in list.
# make data sample with conditions.
synthetic_data_with_conditions = synthesizer.sample_from_conditions(
conditions=conditions)
I won't add full code as it will take up too much space.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论