2023年5月22日 22:05:06go评论103阅读模式

英文:

Shuffle pandas column while avoiding a condition

问题

data = {'Text1': ["All Vegetables are Plants",
                   "Cows are happy",
                   "Butterflies are really beautiful",
                   "I enjoy Mangoes",
                   "Vegetables are green"],
        'Text2': ['Some Plants are good Vegetables',
                  'Cows are enjoying',
                  'Beautiful butterflies are delightful to watch',
                  'Mango pleases me',
                  'Spinach is green'],
        'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}
df = pd.DataFrame(data)
# Shuffle data while ensuring Text1 and Text2 don't have similar relationships
shuffled_df = df.sample(frac=1).reset_index(drop=True)
shuffled_df = shuffled_df.groupby('Relationship').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
print(shuffled_df)

英文:

I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance: <br>
P for Plant, V for Vegetables and F for Fruits. Also, <br>
A for Animal, I for Insects and M for Mammals.

data = {&#39;Text1&#39;: [&quot;All Vegetables are Plants&quot;,
                   &quot;Cows are happy&quot;,
                   &quot;Butterflies are really beautiful&quot;,
                   &quot;I enjoy Mangoes&quot;,
                   &quot;Vegetables are green&quot;],
        &#39;Text2&#39;: [&#39;Some Plants are good Vegetables&#39;,
                  &#39;Cows are enjoying&#39;,
                  &#39;Beautiful butterflies are delightful to watch&#39;,
                  &#39;Mango pleases me&#39;,
                  &#39;Spinach is green&#39;],
        &#39;Relationship&#39;: [&#39;PV123&#39;, &#39;AM4355&#39;, &#39;AI784&#39;, &#39;PF897&#39;, &#39;PV776&#39;]}
df = pd.DataFrame(data)
print(df)
&gt;&gt;&gt;

	Text1	Text2	Relationship
0	All Vegetables are Plants	Some Plants are good Vegetables	PV123
1	Cows eat grass	Grasses are cow's food	AM4355
2	Butterflies are really beautiful	Beautiful butterflies are delightful to watch	AI784
3	I enjoy Mangoes	Mango pleaases me	PF897
4	Vegetables are green	Spinach is green	PV776

I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2 and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.

How can I shuffle my data so I can avoid texts like All Vegetables are Plants and Spinach is green appearing on the same row on Text1 and Text2 respectively?

答案1

得分: 1

这段代码的功能是将两个列的文本数据合并成一个列，并随机选择两次样本，然后找到它们之间的相似性。最后，输出一个包含文本1、文本2和它们之间关系的数据框。

# 合并两列文本成一列
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
           df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
          ignore_index=True, axis=0)
# 提取关系中的字母部分
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)
# 随机选择两次样本并连接
out = pd.concat([df1.sample(10).reset_index(drop=True),
                 df1.sample(10).reset_index(drop=True)],
                axis=1, ignore_index=True)
# 过滤不相等的特征
out = out.loc[out[1].ne(out[3])]
# 获取相似的特征（基于字母的交集）
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)
# 选择所需的列
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})

示例输出：

out.to_dict()
# 输出结果如下:
# {'Text1': {0: 'Cows are enjoying',
#   2: 'Cows are happy',
#   3: 'Spinach is green',
#   4: 'Vegetables are green',
#   5: 'I enjoy Mangoes',
#   6: 'Beautiful butterflies are delightful to watch',
#   7: 'Mango pleases me'},
#  'Text2': {0: 'Vegetables are green',
#   2: 'I enjoy Mangoes',
#   3: 'Beautiful butterflies are delightful to watch',
#   4: 'Mango pleases me',
#   5: 'Cows are enjoying',
#   6: 'Some Plants are good Vegetables',
#   7: 'Cows are happy'},
#  'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}

以上是代码的翻译。

英文:

There may be a more efficient method somewhere, but this will work. The logic is creating a single column of text, taking 2 random samples of these and concatenating. Those with matching relationships (letters only) will be dropped, and an intersection of letters for the two strings is created as the new relationship (note this won't include relationships, as it may miss those characteristics not matching in the initial dataframe).

# 1 column of all text, rather than two
df1 = pd.concat([df[[&quot;Text1&quot;, &quot;Relationship&quot;]].rename(columns={&quot;Text1&quot;: &quot;Text&quot;}),
           df[[&quot;Text2&quot;, &quot;Relationship&quot;]].rename(columns={&quot;Text2&quot;: &quot;Text&quot;})],
          ignore_index=True, axis=0)
# Get letters only for relationships
df1.Relationship = df1[&quot;Relationship&quot;].str.extract(&#39;^([a-zA-Z]+)&#39;, expand=False)
# take 2 random samples and concatenate
out = pd.concat([df1.sample(10).reset_index(drop=True),
                 df1.sample(10).reset_index(drop=True)],
                axis=1, ignore_index=True)
# filter for not equal characteristics only
out = out.loc[out[1].ne(out[3])]
# get similar characteristics (based on intersection of letters)
out[&quot;Relationship&quot;] = out.apply(lambda row: &quot;&quot;.join(list(set(row[1]) &amp; set(row[3]))), axis=1)
# required columns only
out = out[[0, 2, &quot;Relationship&quot;]].rename(columns={0: &quot;Text1&quot;, 2: &quot;Text2&quot;})

Example output:

out.to_dict()
# Out[]:
# {&#39;Text1&#39;: {0: &#39;Cows are enjoying&#39;,
#   2: &#39;Cows are happy&#39;,
#   3: &#39;Spinach is green&#39;,
#   4: &#39;Vegetables are green&#39;,
#   5: &#39;I enjoy Mangoes&#39;,
#   6: &#39;Beautiful butterflies are delightful to watch&#39;,
#   7: &#39;Mango pleases me&#39;},
#  &#39;Text2&#39;: {0: &#39;Vegetables are green&#39;,
#   2: &#39;I enjoy Mangoes&#39;,
#   3: &#39;Beautiful butterflies are delightful to watch&#39;,
#   4: &#39;Mango pleases me&#39;,
#   5: &#39;Cows are enjoying&#39;,
#   6: &#39;Some Plants are good Vegetables&#39;,
#   7: &#39;Cows are happy&#39;},
#  &#39;Relationship&#39;: {0: &#39;&#39;, 2: &#39;&#39;, 3: &#39;&#39;, 4: &quot;&#39;P&#39;&quot;, 5: &#39;&#39;, 6: &#39;&#39;, 7: &#39;&#39;}}

答案2

得分: 0

我通过以下方式解决了这个问题：

创建一个新列，其中包含关系列的前两个字母。
使用这个新列创建了一个多重索引。对这个新列进行分组操作应该也有效。
对于每个组，我使用来自其他组的文本填充了Text2。
然后，我将所有新修改的组合并在一起。

通过这样做，我能够真正创建语义上不同的对。

英文:

I resolved this by:

Creating a new column with the first 2 letters from the relationship column.
Used this new column to create a multi-index. A groupby on this new column should work hear as well.
For each group, I populated Text2 using texts from other groups.
I concatenated back all my newly modified groups.

With this, I was able to really create semantically dissimilar pairs.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Shuffle pandas column while avoiding a condition.

问题

答案1

答案2

优化使用zip()函数处理大数据计算的for循环

CSV数据清洗使用Python/Pandas

停止错误代码”EXPECTED 2D ARRAY”的方法是什么？

读取文件中的值，并根据它们的类型在Python中进行转换。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。