如何基于真实数据创建合成数据?

huangapple go评论79阅读模式
英文:

How to create synthetic data based on real data?

问题

我想基于真实数据生成合成数据。

数据样本:

session_id session_date_time session_status mentor_domain_id mentor_id reg_date_mentor region_id_mentor mentee_id reg_date_mentee region_id_mentee
5528 9165 2022-09-03 00:00:00 finished 5 20410 2022-04-28 00:00:00 6 11557 2021-05-15 00:00:00 3
2370 3891 2022-05-30 00:00:00 canceled 1 20879 2021-10-07 00:00:00 1 10154 2022-05-22 00:00:00 1
6473 10683 2022-09-15 00:00:00 finished 2 21457 2022-01-13 00:00:00 1 14505 2022-09-11 00:00:00 1
1671 2754 2022-04-22 00:00:00 canceled 6 21851 2021-08-24 00:00:00 1 13579 2021-09-12 00:00:00 2
324 527 2021-10-30 00:00:00 finished 1 22243 2021-07-04 00:00:00 1 14096 2021-10-10 00:00:00 10
4500 7453 2022-08-13 00:00:00 finished 4 22199 2021-12-02 00:00:00 5 11743 2021-11-01 00:00:00 8
2356 3875 2022-05-29 00:00:00 finished 2 21434 2022-04-29 00:00:00 4 14960 2021-12-12 00:00:00 0
2722 4491 2022-06-16 00:00:00 finished 2 21462 2022-06-05 00:00:00 7 12627 2021-02-23 00:00:00 2
6016 9929 2022-09-10 00:00:00 finished 1 20802 2021-08-07 00:00:00 1 10121 2022-07-30 00:00:00 1
4899 8121 2022-08-22 00:00:00 finished 1 24920 2021-10-19 00:00:00 5 12223 2022-07-04 00:00:00 4

这些数据是来自数据库的合并表格。我在我的项目中使用了它。

我已经有了许多SQL查询,一些关联矩阵以及一个非线性回归模型。

首先,我需要创建具有类似属性的新数据(因为我不能在我的投资组合案例中使用原始数据)。如果有办法生成更长时间段的数据,那将很棒。

我应该从哪里开始?我能用sklearn.datasets解决这个问题吗?

附注:我已经尝试过合成数据库,但失败了。我不能使用Faker,因为我需要保留数据结构。

英文:

I want to make synthetic data based on real data.

Data sample:

session_id session_date_time session_status mentor_domain_id mentor_id reg_date_mentor region_id_mentor mentee_id reg_date_mentee region_id_mentee
5528 9165 2022-09-03 00:00:00 finished 5 20410 2022-04-28 00:00:00 6 11557 2021-05-15 00:00:00 3
2370 3891 2022-05-30 00:00:00 canceled 1 20879 2021-10-07 00:00:00 1 10154 2022-05-22 00:00:00 1
6473 10683 2022-09-15 00:00:00 finished 2 21457 2022-01-13 00:00:00 1 14505 2022-09-11 00:00:00 1
1671 2754 2022-04-22 00:00:00 canceled 6 21851 2021-08-24 00:00:00 1 13579 2021-09-12 00:00:00 2
324 527 2021-10-30 00:00:00 finished 1 22243 2021-07-04 00:00:00 1 14096 2021-10-10 00:00:00 10
4500 7453 2022-08-13 00:00:00 finished 4 22199 2021-12-02 00:00:00 5 11743 2021-11-01 00:00:00 8
2356 3875 2022-05-29 00:00:00 finished 2 21434 2022-04-29 00:00:00 4 14960 2021-12-12 00:00:00 0
2722 4491 2022-06-16 00:00:00 finished 2 21462 2022-06-05 00:00:00 7 12627 2021-02-23 00:00:00 2
6016 9929 2022-09-10 00:00:00 finished 1 20802 2021-08-07 00:00:00 1 10121 2022-07-30 00:00:00 1
4899 8121 2022-08-22 00:00:00 finished 1 24920 2021-10-19 00:00:00 5 12223 2022-07-04 00:00:00 4

This data is merged tables from database.
I used it for my project.

I got many many SQL queries, few correlation matrix for this data and one non linear regression model.

First of all I need to make new data with similar properties (I can't use original data for my portfolio case).
And it will be great if there's the way to generate data for longer time period.

Where should I start?
Can I solve this problem with sklearn.datasets?

PS I already tryed Synthetic Data Vault and have failed.
I can't use Faker, because I need to keep data structure.

答案1

得分: 1

这是最好的SDG项目,具有图形用户界面:https://github.com/ydataai/ydata-synthetic/

英文:

This is the best SDG project out there and has a GUI: https://github.com/ydataai/ydata-synthetic/

答案2

得分: 0

以下是您提供的代码的中文翻译:

我不确定这是否符合您的要求但以下是使用Faker创建符合特定条件的样本数据的方法

from faker import Faker
import pandas as pd

dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
                 session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
                 mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                 mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                 Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
                 mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))

df1

这将创建一个数据框df如下所示
session_id	session_date_time	session_status	mentor_domain_id	mentor_id	Reg_date_mentor	mentor_mentee_id

0 2030 2022-04-27 Canceled 24 2546 2003-08-21 77
1 4721 2022-01-29 Canceled 26 1205 2003-09-11 60
2 4208 2022-11-15 Canceled 5 1718 2010-08-10 38
3 1220 2022-02-11 Canceled 16 2864 2008-07-30 41
4 4268 2022-05-12 Canceled 30 2160 2009-08-20 67
5 3942 2022-06-02 Canceled 12 1776 2003-11-18 73
6 2229 2022-03-13 Canceled 20 2250 2003-12-28 37
7 1696 2022-06-07 Finished 31 2268 2010-06-04 44
8 3898 2022-11-03 Finished 9 1331 2012-01-08 23
9 3761 2022-11-14 Canceled 29 1682 2008-09-09 47


您可以根据您的具体需求进一步自定义数据并创建列与另一列之间的依赖关系。

<details>
<summary>英文:</summary>

I am not positive this is what you are looking for, but here is a way to use Faker to create sample data that conforms to specific criterion.  

    from faker import Faker
    import pandas as pd
    
    dflen = 10
    df1 = pd.DataFrame()
    df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                    session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime(&#39;2022-01-01&#39;),pd.to_datetime(&#39;2022-12-31&#39;)) for i in range(dflen)),
                     session_status = pd.Series(rnd.choice([&#39;Finished&#39;, &#39;Canceled&#39;]) for i in range(dflen)),
                     mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                     mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                     Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime(&#39;2001-01-01&#39;),pd.to_datetime(&#39;2013-12-31&#39;)) for i in range(dflen)),
                     mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
        
    df1  

This will create a df of the form:
session_id	session_date_time	session_status	mentor_domain_id	mentor_id	Reg_date_mentor	mentor_mentee_id

0 2030 2022-04-27 Canceled 24 2546 2003-08-21 77
1 4721 2022-01-29 Canceled 26 1205 2003-09-11 60
2 4208 2022-11-15 Canceled 5 1718 2010-08-10 38
3 1220 2022-02-11 Canceled 16 2864 2008-07-30 41
4 4268 2022-05-12 Canceled 30 2160 2009-08-20 67
5 3942 2022-06-02 Canceled 12 1776 2003-11-18 73
6 2229 2022-03-13 Canceled 20 2250 2003-12-28 37
7 1696 2022-06-07 Finished 31 2268 2010-06-04 44
8 3898 2022-11-03 Finished 9 1331 2012-01-08 23
9 3761 2022-11-14 Canceled 29 1682 2008-09-09 47


You can further customize data and create reliance between data in one column with another, depending on you specific needs.

</details>



# 答案3
**得分**: 0

我通过Synthetic Data Vault的GaussianCopulaSynthesizer创建新数据。
我为某些列添加了一些预定义的约束类,并运行条件抽样以保持原始数据集的属性。

```python
# 为数据集创建元数据(这不是必需的步骤,因为元数据会自动检测)。
# 我已经更新了每一列的元数据

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

metadata.update_column(
    column_name='session_id',
    sdtype='id',
    regex_format='[0-9]{6}')
metadata.validate()

# 创建合成器(这个合成器对我的数据效果更好):

distributions = {
    'reg_date_mentee': 'uniform',
    'mentee_id': 'uniform'
}

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)

# 为合成器添加约束(添加规则,数据中的每一行都必须遵循)。
# 我为大多数列添加了约束。

my_constraint_mentee_id = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'mentee_id',
        'low_value': 20001,
        'high_value': 21847,
        'strict_boundaries': False
    }
}

synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])

# 拟合合成器;

synthesizer.fit(sessions_and_users1)

# 创建条件列表;

# 通过sdv.sampling中的Condition创建所需的条件。
# 所有条件都保存在列表中。

# 使用条件生成带条件的数据样本。

synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

我不会添加完整的代码,因为会占用太多空间。

英文:

I make new data by GaussianCopulaSynthesizer from Synthetic Data Vault.
I add some Predefined Constraint Classes for some columns and run conditional sampling to keep properties of original dataset.

# create metadata for dataset (it&#39;s not required step, cause metadata detects automatically).
# I had updated metadata for every column

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

metadata.update_column(
    column_name=&#39;session_id&#39;,
    sdtype=&#39;id&#39;,
    regex_format=&#39;[0-9]{6}&#39;)
metadata.validate()

# create synthesizer (this synthesizer works better for my data):


distributions = {
    &#39;reg_date_mentee&#39;: &#39;uniform&#39;,
    &#39;mentee_id&#39;: &#39;uniform&#39;
}

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)

# add constraints to synthesizer (adding rules that every row in the data must follow).
# I add constraints for most columns.

my_constraint_mentee_id = {
    &#39;constraint_class&#39;: &#39;ScalarRange&#39;,
    &#39;constraint_parameters&#39;: {
        &#39;column_name&#39;: &#39;mentee_id&#39;,
        &#39;low_value&#39;: 20001,
        &#39;high_value&#39;: 21847,
        &#39;strict_boundaries&#39;: False
    }
}

synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])

# fit synthesizer;

synthesizer.fit(sessions_and_users1)

# make list of conditions;

# Make conditions you need by Condition from sdv.sampling.
# All conditions keeping in list.


# make data sample with conditions.

synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

I won't add full code as it will take up too much space.

huangapple
  • 本文由 发表于 2023年6月26日 18:06:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76555652.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定