英文:
Shuffle pandas column while avoiding a condition
问题
data = {'Text1': ["All Vegetables are Plants",
"Cows are happy",
"Butterflies are really beautiful",
"I enjoy Mangoes",
"Vegetables are green"],
'Text2': ['Some Plants are good Vegetables',
'Cows are enjoying',
'Beautiful butterflies are delightful to watch',
'Mango pleases me',
'Spinach is green'],
'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}
df = pd.DataFrame(data)
# Shuffle data while ensuring Text1 and Text2 don't have similar relationships
shuffled_df = df.sample(frac=1).reset_index(drop=True)
shuffled_df = shuffled_df.groupby('Relationship').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
print(shuffled_df)
英文:
I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance: <br>
P for Plant, V for Vegetables and F for Fruits. Also, <br>
A for Animal, I for Insects and M for Mammals.
data = {'Text1': ["All Vegetables are Plants",
"Cows are happy",
"Butterflies are really beautiful",
"I enjoy Mangoes",
"Vegetables are green"],
'Text2': ['Some Plants are good Vegetables',
'Cows are enjoying',
'Beautiful butterflies are delightful to watch',
'Mango pleases me',
'Spinach is green'],
'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}
df = pd.DataFrame(data)
print(df)
>>>
Text1 | Text2 | Relationship | |
---|---|---|---|
0 | All Vegetables are Plants | Some Plants are good Vegetables | PV123 |
1 | Cows eat grass | Grasses are cow's food | AM4355 |
2 | Butterflies are really beautiful | Beautiful butterflies are delightful to watch | AI784 |
3 | I enjoy Mangoes | Mango pleaases me | PF897 |
4 | Vegetables are green | Spinach is green | PV776 |
I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2
and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.
How can I shuffle my data so I can avoid texts like All Vegetables are Plants
and Spinach is green
appearing on the same row on Text1
and Text2
respectively?
答案1
得分: 1
这段代码的功能是将两个列的文本数据合并成一个列,并随机选择两次样本,然后找到它们之间的相似性。最后,输出一个包含文本1、文本2和它们之间关系的数据框。
# 合并两列文本成一列
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
ignore_index=True, axis=0)
# 提取关系中的字母部分
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)
# 随机选择两次样本并连接
out = pd.concat([df1.sample(10).reset_index(drop=True),
df1.sample(10).reset_index(drop=True)],
axis=1, ignore_index=True)
# 过滤不相等的特征
out = out.loc[out[1].ne(out[3])]
# 获取相似的特征(基于字母的交集)
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)
# 选择所需的列
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})
示例输出:
out.to_dict()
# 输出结果如下:
# {'Text1': {0: 'Cows are enjoying',
# 2: 'Cows are happy',
# 3: 'Spinach is green',
# 4: 'Vegetables are green',
# 5: 'I enjoy Mangoes',
# 6: 'Beautiful butterflies are delightful to watch',
# 7: 'Mango pleases me'},
# 'Text2': {0: 'Vegetables are green',
# 2: 'I enjoy Mangoes',
# 3: 'Beautiful butterflies are delightful to watch',
# 4: 'Mango pleases me',
# 5: 'Cows are enjoying',
# 6: 'Some Plants are good Vegetables',
# 7: 'Cows are happy'},
# 'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}
以上是代码的翻译。
英文:
There may be a more efficient method somewhere, but this will work. The logic is creating a single column of text, taking 2 random samples of these and concatenating. Those with matching relationships (letters only) will be dropped, and an intersection of letters for the two strings is created as the new relationship (note this won't include relationships, as it may miss those characteristics not matching in the initial dataframe).
# 1 column of all text, rather than two
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
ignore_index=True, axis=0)
# Get letters only for relationships
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)
# take 2 random samples and concatenate
out = pd.concat([df1.sample(10).reset_index(drop=True),
df1.sample(10).reset_index(drop=True)],
axis=1, ignore_index=True)
# filter for not equal characteristics only
out = out.loc[out[1].ne(out[3])]
# get similar characteristics (based on intersection of letters)
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)
# required columns only
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})
Example output:
out.to_dict()
# Out[]:
# {'Text1': {0: 'Cows are enjoying',
# 2: 'Cows are happy',
# 3: 'Spinach is green',
# 4: 'Vegetables are green',
# 5: 'I enjoy Mangoes',
# 6: 'Beautiful butterflies are delightful to watch',
# 7: 'Mango pleases me'},
# 'Text2': {0: 'Vegetables are green',
# 2: 'I enjoy Mangoes',
# 3: 'Beautiful butterflies are delightful to watch',
# 4: 'Mango pleases me',
# 5: 'Cows are enjoying',
# 6: 'Some Plants are good Vegetables',
# 7: 'Cows are happy'},
# 'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}
答案2
得分: 0
我通过以下方式解决了这个问题:
- 创建一个新列,其中包含关系列的前两个字母。
- 使用这个新列创建了一个多重索引。对这个新列进行分组操作应该也有效。
- 对于每个组,我使用来自其他组的文本填充了Text2。
- 然后,我将所有新修改的组合并在一起。
通过这样做,我能够真正创建语义上不同的对。
英文:
I resolved this by:
- Creating a new column with the first 2 letters from the relationship column.
- Used this new column to create a multi-index. A groupby on this new column should work hear as well.
- For each group, I populated Text2 using texts from other groups.
- I concatenated back all my newly modified groups.
With this, I was able to really create semantically dissimilar pairs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论