Shuffle pandas column while avoiding a condition.

huangapple go评论67阅读模式
英文:

Shuffle pandas column while avoiding a condition

问题

data = {'Text1': ["All Vegetables are Plants",
                   "Cows are happy",
                   "Butterflies are really beautiful",
                   "I enjoy Mangoes",
                   "Vegetables are green"],
        'Text2': ['Some Plants are good Vegetables',
                  'Cows are enjoying',
                  'Beautiful butterflies are delightful to watch',
                  'Mango pleases me',
                  'Spinach is green'],
        'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}

df = pd.DataFrame(data)

# Shuffle data while ensuring Text1 and Text2 don't have similar relationships
shuffled_df = df.sample(frac=1).reset_index(drop=True)
shuffled_df = shuffled_df.groupby('Relationship').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)

print(shuffled_df)
英文:

I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance: <br>
P for Plant, V for Vegetables and F for Fruits. Also, <br>
A for Animal, I for Insects and M for Mammals.

data = {&#39;Text1&#39;: [&quot;All Vegetables are Plants&quot;,
                   &quot;Cows are happy&quot;,
                   &quot;Butterflies are really beautiful&quot;,
                   &quot;I enjoy Mangoes&quot;,
                   &quot;Vegetables are green&quot;],
        &#39;Text2&#39;: [&#39;Some Plants are good Vegetables&#39;,
                  &#39;Cows are enjoying&#39;,
                  &#39;Beautiful butterflies are delightful to watch&#39;,
                  &#39;Mango pleases me&#39;,
                  &#39;Spinach is green&#39;],
        &#39;Relationship&#39;: [&#39;PV123&#39;, &#39;AM4355&#39;, &#39;AI784&#39;, &#39;PF897&#39;, &#39;PV776&#39;]}

df = pd.DataFrame(data)

print(df)

&gt;&gt;&gt;
Text1 Text2 Relationship
0 All Vegetables are Plants Some Plants are good Vegetables PV123
1 Cows eat grass Grasses are cow's food AM4355
2 Butterflies are really beautiful Beautiful butterflies are delightful to watch AI784
3 I enjoy Mangoes Mango pleaases me PF897
4 Vegetables are green Spinach is green PV776

I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2 and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.

How can I shuffle my data so I can avoid texts like All Vegetables are Plants and Spinach is green appearing on the same row on Text1 and Text2 respectively?

答案1

得分: 1

这段代码的功能是将两个列的文本数据合并成一个列,并随机选择两次样本,然后找到它们之间的相似性。最后,输出一个包含文本1、文本2和它们之间关系的数据框。

# 合并两列文本成一列
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
           df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
          ignore_index=True, axis=0)

# 提取关系中的字母部分
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)

# 随机选择两次样本并连接
out = pd.concat([df1.sample(10).reset_index(drop=True),
                 df1.sample(10).reset_index(drop=True)],
                axis=1, ignore_index=True)

# 过滤不相等的特征
out = out.loc[out[1].ne(out[3])]

# 获取相似的特征(基于字母的交集)
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)

# 选择所需的列
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})

示例输出:

out.to_dict()
# 输出结果如下:
# {'Text1': {0: 'Cows are enjoying',
#   2: 'Cows are happy',
#   3: 'Spinach is green',
#   4: 'Vegetables are green',
#   5: 'I enjoy Mangoes',
#   6: 'Beautiful butterflies are delightful to watch',
#   7: 'Mango pleases me'},
#  'Text2': {0: 'Vegetables are green',
#   2: 'I enjoy Mangoes',
#   3: 'Beautiful butterflies are delightful to watch',
#   4: 'Mango pleases me',
#   5: 'Cows are enjoying',
#   6: 'Some Plants are good Vegetables',
#   7: 'Cows are happy'},
#  'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}

以上是代码的翻译。

英文:

There may be a more efficient method somewhere, but this will work. The logic is creating a single column of text, taking 2 random samples of these and concatenating. Those with matching relationships (letters only) will be dropped, and an intersection of letters for the two strings is created as the new relationship (note this won't include relationships, as it may miss those characteristics not matching in the initial dataframe).

# 1 column of all text, rather than two
df1 = pd.concat([df[[&quot;Text1&quot;, &quot;Relationship&quot;]].rename(columns={&quot;Text1&quot;: &quot;Text&quot;}),
           df[[&quot;Text2&quot;, &quot;Relationship&quot;]].rename(columns={&quot;Text2&quot;: &quot;Text&quot;})],
          ignore_index=True, axis=0)

# Get letters only for relationships
df1.Relationship = df1[&quot;Relationship&quot;].str.extract(&#39;^([a-zA-Z]+)&#39;, expand=False)

# take 2 random samples and concatenate
out = pd.concat([df1.sample(10).reset_index(drop=True),
                 df1.sample(10).reset_index(drop=True)],
                axis=1, ignore_index=True)
# filter for not equal characteristics only
out = out.loc[out[1].ne(out[3])]

# get similar characteristics (based on intersection of letters)
out[&quot;Relationship&quot;] = out.apply(lambda row: &quot;&quot;.join(list(set(row[1]) &amp; set(row[3]))), axis=1)

# required columns only
out = out[[0, 2, &quot;Relationship&quot;]].rename(columns={0: &quot;Text1&quot;, 2: &quot;Text2&quot;})

Example output:

out.to_dict()
# Out[]:
# {&#39;Text1&#39;: {0: &#39;Cows are enjoying&#39;,
#   2: &#39;Cows are happy&#39;,
#   3: &#39;Spinach is green&#39;,
#   4: &#39;Vegetables are green&#39;,
#   5: &#39;I enjoy Mangoes&#39;,
#   6: &#39;Beautiful butterflies are delightful to watch&#39;,
#   7: &#39;Mango pleases me&#39;},
#  &#39;Text2&#39;: {0: &#39;Vegetables are green&#39;,
#   2: &#39;I enjoy Mangoes&#39;,
#   3: &#39;Beautiful butterflies are delightful to watch&#39;,
#   4: &#39;Mango pleases me&#39;,
#   5: &#39;Cows are enjoying&#39;,
#   6: &#39;Some Plants are good Vegetables&#39;,
#   7: &#39;Cows are happy&#39;},
#  &#39;Relationship&#39;: {0: &#39;&#39;, 2: &#39;&#39;, 3: &#39;&#39;, 4: &quot;&#39;P&#39;&quot;, 5: &#39;&#39;, 6: &#39;&#39;, 7: &#39;&#39;}}

答案2

得分: 0

我通过以下方式解决了这个问题:

  1. 创建一个新列,其中包含关系列的前两个字母。
  2. 使用这个新列创建了一个多重索引。对这个新列进行分组操作应该也有效。
  3. 对于每个组,我使用来自其他组的文本填充了Text2。
  4. 然后,我将所有新修改的组合并在一起。

通过这样做,我能够真正创建语义上不同的对。

英文:

I resolved this by:

  1. Creating a new column with the first 2 letters from the relationship column.
  2. Used this new column to create a multi-index. A groupby on this new column should work hear as well.
  3. For each group, I populated Text2 using texts from other groups.
  4. I concatenated back all my newly modified groups.

With this, I was able to really create semantically dissimilar pairs.

huangapple
  • 本文由 发表于 2023年5月22日 22:05:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307021.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定