2023年6月26日 18:06:43go评论115阅读模式

英文:

How to create synthetic data based on real data?

问题

我想基于真实数据生成合成数据。

数据样本：

	session_id	session_date_time	session_status	mentor_domain_id	mentor_id	reg_date_mentor	region_id_mentor	mentee_id	reg_date_mentee	region_id_mentee
5528	9165	2022-09-03 00:00:00	finished	5	20410	2022-04-28 00:00:00	6	11557	2021-05-15 00:00:00	3
2370	3891	2022-05-30 00:00:00	canceled	1	20879	2021-10-07 00:00:00	1	10154	2022-05-22 00:00:00	1
6473	10683	2022-09-15 00:00:00	finished	2	21457	2022-01-13 00:00:00	1	14505	2022-09-11 00:00:00	1
1671	2754	2022-04-22 00:00:00	canceled	6	21851	2021-08-24 00:00:00	1	13579	2021-09-12 00:00:00	2
324	527	2021-10-30 00:00:00	finished	1	22243	2021-07-04 00:00:00	1	14096	2021-10-10 00:00:00	10
4500	7453	2022-08-13 00:00:00	finished	4	22199	2021-12-02 00:00:00	5	11743	2021-11-01 00:00:00	8
2356	3875	2022-05-29 00:00:00	finished	2	21434	2022-04-29 00:00:00	4	14960	2021-12-12 00:00:00	0
2722	4491	2022-06-16 00:00:00	finished	2	21462	2022-06-05 00:00:00	7	12627	2021-02-23 00:00:00	2
6016	9929	2022-09-10 00:00:00	finished	1	20802	2021-08-07 00:00:00	1	10121	2022-07-30 00:00:00	1
4899	8121	2022-08-22 00:00:00	finished	1	24920	2021-10-19 00:00:00	5	12223	2022-07-04 00:00:00	4

这些数据是来自数据库的合并表格。我在我的项目中使用了它。

我已经有了许多SQL查询，一些关联矩阵以及一个非线性回归模型。

首先，我需要创建具有类似属性的新数据（因为我不能在我的投资组合案例中使用原始数据）。如果有办法生成更长时间段的数据，那将很棒。

我应该从哪里开始？我能用sklearn.datasets解决这个问题吗？

附注：我已经尝试过合成数据库，但失败了。我不能使用Faker，因为我需要保留数据结构。

英文:

I want to make synthetic data based on real data.

Data sample:

	session_id	session_date_time	session_status	mentor_domain_id	mentor_id	reg_date_mentor	region_id_mentor	mentee_id	reg_date_mentee	region_id_mentee
5528	9165	2022-09-03 00:00:00	finished	5	20410	2022-04-28 00:00:00	6	11557	2021-05-15 00:00:00	3
2370	3891	2022-05-30 00:00:00	canceled	1	20879	2021-10-07 00:00:00	1	10154	2022-05-22 00:00:00	1
6473	10683	2022-09-15 00:00:00	finished	2	21457	2022-01-13 00:00:00	1	14505	2022-09-11 00:00:00	1
1671	2754	2022-04-22 00:00:00	canceled	6	21851	2021-08-24 00:00:00	1	13579	2021-09-12 00:00:00	2
324	527	2021-10-30 00:00:00	finished	1	22243	2021-07-04 00:00:00	1	14096	2021-10-10 00:00:00	10
4500	7453	2022-08-13 00:00:00	finished	4	22199	2021-12-02 00:00:00	5	11743	2021-11-01 00:00:00	8
2356	3875	2022-05-29 00:00:00	finished	2	21434	2022-04-29 00:00:00	4	14960	2021-12-12 00:00:00	0
2722	4491	2022-06-16 00:00:00	finished	2	21462	2022-06-05 00:00:00	7	12627	2021-02-23 00:00:00	2
6016	9929	2022-09-10 00:00:00	finished	1	20802	2021-08-07 00:00:00	1	10121	2022-07-30 00:00:00	1
4899	8121	2022-08-22 00:00:00	finished	1	24920	2021-10-19 00:00:00	5	12223	2022-07-04 00:00:00	4

This data is merged tables from database.
I used it for my project.

I got many many SQL queries, few correlation matrix for this data and one non linear regression model.

First of all I need to make new data with similar properties (I can't use original data for my portfolio case).
And it will be great if there's the way to generate data for longer time period.

Where should I start?
Can I solve this problem with sklearn.datasets?

PS I already tryed Synthetic Data Vault and have failed.
I can't use Faker, because I need to keep data structure.

答案1

得分: 1

这是最好的SDG项目，具有图形用户界面：https://github.com/ydataai/ydata-synthetic/

英文:

This is the best SDG project out there and has a GUI: https://github.com/ydataai/ydata-synthetic/

答案2

得分: 0

以下是您提供的代码的中文翻译：

我不确定这是否符合您的要求，但以下是使用Faker创建符合特定条件的样本数据的方法。
from faker import Faker
import pandas as pd
dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
                 session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
                 mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                 mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                 Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
                 mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
df1
这将创建一个数据框（df）如下所示：

session_id	session_date_time	session_status	mentor_domain_id	mentor_id	Reg_date_mentor	mentor_mentee_id

0 2030 2022-04-27 Canceled 24 2546 2003-08-21 77
1 4721 2022-01-29 Canceled 26 1205 2003-09-11 60
2 4208 2022-11-15 Canceled 5 1718 2010-08-10 38
3 1220 2022-02-11 Canceled 16 2864 2008-07-30 41
4 4268 2022-05-12 Canceled 30 2160 2009-08-20 67
5 3942 2022-06-02 Canceled 12 1776 2003-11-18 73
6 2229 2022-03-13 Canceled 20 2250 2003-12-28 37
7 1696 2022-06-07 Finished 31 2268 2010-06-04 44
8 3898 2022-11-03 Finished 9 1331 2012-01-08 23
9 3761 2022-11-14 Canceled 29 1682 2008-09-09 47


您可以根据您的具体需求进一步自定义数据并创建列与另一列之间的依赖关系。
<details>
<summary>英文:</summary>
I am not positive this is what you are looking for, but here is a way to use Faker to create sample data that conforms to specific criterion.  
    from faker import Faker
    import pandas as pd
    
    dflen = 10
    df1 = pd.DataFrame()
    df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                    session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime(&#39;2022-01-01&#39;),pd.to_datetime(&#39;2022-12-31&#39;)) for i in range(dflen)),
                     session_status = pd.Series(rnd.choice([&#39;Finished&#39;, &#39;Canceled&#39;]) for i in range(dflen)),
                     mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                     mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                     Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime(&#39;2001-01-01&#39;),pd.to_datetime(&#39;2013-12-31&#39;)) for i in range(dflen)),
                     mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
        
    df1  
This will create a df of the form:

session_id	session_date_time	session_status	mentor_domain_id	mentor_id	Reg_date_mentor	mentor_mentee_id


You can further customize data and create reliance between data in one column with another, depending on you specific needs.
</details>
# 答案3
**得分**: 0
我通过Synthetic Data Vault的GaussianCopulaSynthesizer创建新数据。
我为某些列添加了一些预定义的约束类，并运行条件抽样以保持原始数据集的属性。
```python
# 为数据集创建元数据（这不是必需的步骤，因为元数据会自动检测）。
# 我已经更新了每一列的元数据
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(
    column_name='session_id',
    sdtype='id',
    regex_format='[0-9]{6}')
metadata.validate()
# 创建合成器（这个合成器对我的数据效果更好）：
distributions = {
    'reg_date_mentee': 'uniform',
    'mentee_id': 'uniform'
}
synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)
# 为合成器添加约束（添加规则，数据中的每一行都必须遵循）。
# 我为大多数列添加了约束。
my_constraint_mentee_id = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'mentee_id',
        'low_value': 20001,
        'high_value': 21847,
        'strict_boundaries': False
    }
}
synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])
# 拟合合成器；
synthesizer.fit(sessions_and_users1)
# 创建条件列表；
# 通过sdv.sampling中的Condition创建所需的条件。
# 所有条件都保存在列表中。
# 使用条件生成带条件的数据样本。
synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

我不会添加完整的代码，因为会占用太多空间。

英文:

I make new data by GaussianCopulaSynthesizer from Synthetic Data Vault.
I add some Predefined Constraint Classes for some columns and run conditional sampling to keep properties of original dataset.

# create metadata for dataset (it&#39;s not required step, cause metadata detects automatically).
# I had updated metadata for every column
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(
    column_name=&#39;session_id&#39;,
    sdtype=&#39;id&#39;,
    regex_format=&#39;[0-9]{6}&#39;)
metadata.validate()
# create synthesizer (this synthesizer works better for my data):
distributions = {
    &#39;reg_date_mentee&#39;: &#39;uniform&#39;,
    &#39;mentee_id&#39;: &#39;uniform&#39;
}
synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)
# add constraints to synthesizer (adding rules that every row in the data must follow).
# I add constraints for most columns.
my_constraint_mentee_id = {
    &#39;constraint_class&#39;: &#39;ScalarRange&#39;,
    &#39;constraint_parameters&#39;: {
        &#39;column_name&#39;: &#39;mentee_id&#39;,
        &#39;low_value&#39;: 20001,
        &#39;high_value&#39;: 21847,
        &#39;strict_boundaries&#39;: False
    }
}
synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])
# fit synthesizer;
synthesizer.fit(sessions_and_users1)
# make list of conditions;
# Make conditions you need by Condition from sdv.sampling.
# All conditions keeping in list.
# make data sample with conditions.
synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

I won't add full code as it will take up too much space.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何基于真实数据创建合成数据？

问题

答案1

答案2

How can we merge column headers from multiple CSVs into one dataframe, and list source file names for each file in one column?

IndexError: 在实现翻译的Transformer模型时，self中的索引超出范围

Tkinter 图形用户界面启动按钮注册输入但不重新启动程序

How would I be able to delete an element in my shopping list and delete the \n with it so it moves the list up and doesn't leave gaps

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。