2023年3月7日 02:47:25go评论142阅读模式

英文:

Create pandas dataframe column with random conditional numbers

问题

我已创建了以下的pandas数据框。

import pandas as pd
import numpy as np

ds = {'col1' : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      'col2' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

data = pd.DataFrame(data=ds)

它看起来像这样：

print(data)

    col1  col2
0      1     0
1      1     0
2      1     0
3      1     0
4      1     0
5      1     0
6      1     0
7      1     0
8      1     0
9      1     0
10     1     0
11     1     0
12     1     0
13     1     0
14     2     1
15     2     1
16     2     1
17     2     1
18     2     1
19     2     1
20     2     1
21     2     1
22     2     1
23     2     1
24     2     1
25     2     1
26     2     1
27     2     1

我需要创建一个新的列（称为 col3），根据以下条件：

当 col1 = 1 时，有14条记录，其中 col2 = 0。新列（即 col3）需要有14条记录中的50%的值等于 col2（在这14条记录中随机分布），剩余的50%等于1。
当 col1 = 2 时，有14条记录，其中 col2 = 1。新列（即 col3）需要有14条记录中的50%的值等于 col2（在这14条记录中随机分布），剩余的50%等于0。

因此，最终的数据集将如下所示（请注意，col3 中的值的位置或记录是随机分配的）。

你需要的Python代码如下：

import random

# Define a function to generate col3 values based on the conditions
def generate_col3(row):
    if row['col1'] == 1:
        if row['col2'] == 0:
            return [random.choice([0, 1]) for _ in range(14)]
        else:
            return [1] * 14
    elif row['col1'] == 2:
        if row['col2'] == 1:
            return [random.choice([0, 1]) for _ in range(14)]
        else:
            return [0] * 14

# Apply the function to create the col3 column
data['col3'] = data.apply(generate_col3, axis=1)

# Explode the lists in col3 to separate rows
data = data.explode('col3', ignore_index=True)

# Shuffle the rows to randomize the order
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

print(data)

这将生成符合您要求的数据框。

英文:

I have created the following pandas dataframe.

import pandas as pd
import numpy as np

ds = {&#39;col1&#39; : [1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	2,	2,	2,	2,	2,	2,	2,	2,	2,	2,	2,	2,	2,	2],
      &#39;col2&#39; : [0,	0,	0,	0,	0,	0,	0,	0,	0,	0,	0,	0,	0,	0,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1,	1]}

data = pd.DataFrame(data=ds)

which looks like this:

print(data)

    col1  col2
0      1     0
1      1     0
2      1     0
3      1     0
4      1     0
5      1     0
6      1     0
7      1     0
8      1     0
9      1     0
10     1     0
11     1     0
12     1     0
13     1     0
14     2     1
15     2     1
16     2     1
17     2     1
18     2     1
19     2     1
20     2     1
21     2     1
22     2     1
23     2     1
24     2     1
25     2     1
26     2     1
27     2     1

I need to create a new column (called col3) subject to the following conditions:

when col1 = 1, there are 14 records for which col2 = 0. The new column (i.e. col3), needs to have 50% (of exactly those 14 records) of the values equal to col2 (randomly distributed across the 14 records) and the remaining 50% equal to 1.
when col1 = 2, there are 14 records for which col2 = 1. The new column (i.e. col3), needs to have 50% (of exactly those 14 records) of the values equal to col2 (randomly distributed across the 14 records) and the remaining 50% equal to 0.

So, the resulting dataset would look like this (bear in mind that the location - or record - of the values in col3 is randomly assigned):

Does anyone know the python code to produce such dataframe?

答案1

得分: 2

# 从col1的每个唯一值中随机抽取50%的样本到col3中
data['col3'] = data.groupby('col1')['col2'].sample(frac=.5)

# 使用预定义的col1值的映射填充剩余的50%
data['col3'] = data['col3'].fillna(data['col1'].map({1: 1, 2: 0}), downcast='infer')

英文:

`groupby` + `sample`

# take a sample of 50% from col2 per unique value in col1
data[&#39;col3&#39;] = data.groupby(&#39;col1&#39;)[&#39;col2&#39;].sample(frac=.5)

# fill the remaining 50% using a predefined mapping of col1 value
data[&#39;col3&#39;] = data[&#39;col3&#39;].fillna(data[&#39;col1&#39;].map({1: 1, 2: 0}), downcast=&#39;infer&#39;)

Result

    col1  col2  col3
0      1     0     1
1      1     0     0
2      1     0     0
3      1     0     0
4      1     0     0
5      1     0     0
6      1     0     0
7      1     0     1
8      1     0     0
9      1     0     1
10     1     0     1
11     1     0     1
12     1     0     1
13     1     0     1
14     2     1     1
15     2     1     0
16     2     1     1
17     2     1     0
18     2     1     0
19     2     1     1
20     2     1     1
21     2     1     0
22     2     1     1
23     2     1     0
24     2     1     0
25     2     1     1
26     2     1     1
27     2     1     0

答案2

得分: 1

我将为您翻译代码部分，以下是翻译好的内容：

# 使用df.sample()方法在条件内随机分离所有子组，然后分配值
ds = {'col1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      'col2': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

data = pd.DataFrame(data=ds)
data['col_3'] = 0

# 设置数据帧并创建一个新的空列

# 现在让我们根据条件进行选择
cond_1 = data.loc[(data.col1 == 1) & (data.col2 == 0)]
cond_2 = data.loc[(data.col1 == 2) & (data.col2 == 1)]

# 获取第一个条件的随机50%，然后获取剩下的50%
cond_1_A = cond_1.sample(frac=0.5)
cond_1_B = cond_1.loc[cond_1.index.difference(cond_1_A.index)]

# 对于每个子组，将值设置为0或1
data.col_3.loc[cond_1_A.index] = 0
data.col_3.loc[cond_1_B.index] = 1

# 第二个条件 - 同样的操作
cond_2_A = cond_2.sample(frac=0.5)
cond_2_B = cond_2.loc[cond_2.index.difference(cond_2_A.index)]
data.col_3.loc[cond_2_A.index] = 0
data.col_3.loc[cond_2_B.index] = 1

# 完成

运行1：

   col1  col2  col_3
0     1     0      0
1     1     0      1
2     1     0      1
3     1     0      0
4     1     0      1
5     1     0      1
6     1     0      0
7     1     0      0
8     1     0      0
9     1     0      1
10    1     0      0
11    1     0      0
12    1     0      1
13    1     0      1
14    2     1      1
15    2     1      1
16    2     1      0
17    2     1      0
18    2     1      0
19    2     1      0
20    2     1      1
21    2     1      0
22    2     1      0
23    2     1      1
24    2     1      1
25    2     1      1
26    2     1      0
27    2     1      1

运行2：

   col1  col2  col_3
0     1     0      1
1     1     0      0
2     1     0      1
3     1     0      1
4     1     0      1
5     1     0      0
6     1     0      1
7     1     0      0
8     1     0      0
9     1     0      0
10    1     0      1
11    1     0      0
12    1     0      0
13    1     0      1
14    2     1      1
15    2     1      0
16    2     1      0
17    2     1      1
18    2     1      1
19    2     1      1
20    2     1      1
21    2     1      0
22    2     1      1
23    2     1      0
24    2     1      0
25    2     1      1
26    2     1      0
27    2     1      0

英文:

I would use the df.sample() method to isolate all the subgroups randomly within the condition, and then assign the values

ds = {&#39;col1&#39; : [1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
      &#39;col2&#39; : [0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]}

data = pd.DataFrame(data=ds)
data[&#39;col_3&#39;] = 0

sets up the dataframe and creates a new, empty column

Now let's select by condition

cond_1 = data.loc[(data.col1==1)&amp;(data.col2==0)]
cond_2 = data.loc[(data.col1==2)&amp;(data.col2==1)]

Get a random 50 % of the first condition, and then get the remaining 50%

cond_1_A = cond_1.sample(frac=.5)
cond_1_B = cond_1.loc[cond_1.index.difference(cond_1_A.index)]

For each sub-group, set the value to 0 or 1

data.col_3.loc[cond_1_A.index] = 0
data.col_3.loc[cond_1_B.index] = 1

Second Condition - Same Thing

cond_2_A = cond_2.sample(frac=.5)
cond_2_B = cond_2.loc[cond_2.index.difference(cond_2_A.index)]
data.col_3.loc[cond_2_A.index] = 0
data.col_3.loc[cond_2_B.index] = 1

That should do it.

Run 1

data
	col1	col2	col_3
0	1	0	0
1	1	0	1
2	1	0	1
3	1	0	0
4	1	0	1
5	1	0	1
6	1	0	0
7	1	0	0
8	1	0	0
9	1	0	1
10	1	0	0
11	1	0	0
12	1	0	1
13	1	0	1
14	2	1	1
15	2	1	1
16	2	1	0
17	2	1	0
18	2	1	0
19	2	1	0
20	2	1	1
21	2	1	0
22	2	1	0
23	2	1	1
24	2	1	1
25	2	1	1
26	2	1	0
27	2	1	1

Run 2

data
	col1	col2	col_3
0	1	0	1
1	1	0	0
2	1	0	1
3	1	0	1
4	1	0	1
5	1	0	0
6	1	0	1
7	1	0	0
8	1	0	0
9	1	0	0
10	1	0	1
11	1	0	0
12	1	0	0
13	1	0	1
14	2	1	1
15	2	1	0
16	2	1	0
17	2	1	1
18	2	1	1
19	2	1	1
20	2	1	1
21	2	1	0
22	2	1	1
23	2	1	0
24	2	1	0
25	2	1	1
26	2	1	0
27	2	1	0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建带有随机条件数的 Pandas 数据框列。

问题

答案1

`groupby` + `sample`

答案2

使用rvest解析一个包含HTML和非HTML输入的类chr的数据框列。

working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

根据多列的条件从DataFrame中删除重复行

使用any()创建多个条目的列表理解在Pandas中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

groupby + sample

答案2

发表评论

`groupby` + `sample`