根据阈值创建超集和子集。

huangapple go评论100阅读模式
英文:

Creating Superset and subset according to a threshold value

问题

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
9 G
10 G

I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.

Output file:

id letter sub-ids
1 A, B, D, E, F 5
2 B, C 3
4 D, B 3
6 X, Y, Z 7
9 G 10

The code I tried is:

  1. import itertools
  2. df = pd.DataFrame({
  3. 'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  4. 'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
  5. lst = df['letter'].values.tolist()
  6. lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
  7. out = lst[:]
  8. for tup1,tup2 in itertools.combinations(lst, 2):
  9. a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
  10. if all(elem in b for elem in a) and a in out:
  11. out.remove(a)
  12. if all(elem in b for elem in a) and 0.5<=len(a)/len(b)<=1.0:
  13. out.append(b)
  14. filt = list(map(', '.join, out))
  15. df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
  16. print(df2)
英文:

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
9 G
10 G

I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.
Output file:

id letter sub-ids
1 A, B, D, E, F 5
2 B, C 3
4 D, B 3
6 X, Y, Z 7
9 G 10

The code I tried is:

  1. import itertools
  2. df = pd.DataFrame({
  3. &#39;id&#39;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  4. &#39;letter&#39;: [&#39;A, B, D, E, F&#39;,&#39;B, C&#39;,&#39;B&#39;,&#39;D, B&#39;,&#39;B, D, A&#39;,&#39;X, Y, Z&#39;,&#39;X, Y&#39;,&#39;E, D&#39;,&#39;G&#39;,&#39;G&#39;]})
  5. lst = df[&#39;letter&#39;].values.tolist()
  6. lst = list(tuple(item.strip() for item in x.split(&#39;,&#39;)) for x in lst)
  7. out = lst[:]
  8. for tup1,tup2 in itertools.combinations(lst, 2):
  9. a, b = (tup1, tup2) if len(tup1) &lt; len(tup2) else (tup2, tup1)
  10. if all(elem in b for elem in a) and a in out:
  11. out.remove(a)
  12. if all(elem in b for elem in a) and 0.5&lt;=len(a)/len(b)&lt;=1.0:
  13. out.append(b)
  14. filt = list(map(&#39;, &#39;.join, out))
  15. df2 = df.loc[df[&#39;letter&#39;].isin(filt), :].drop_duplicates(subset=&#39;letter&#39;)
  16. print(df2)

答案1

得分: 1

以下是代码的翻译部分:

  1. import pandas as pd
  2. df: pd.DataFrame = pd.DataFrame([
  3. ["A, B, D, E, F"], ["B, C"], ["B"], ["D, B"], ["B, D, A"], ["X, Y, Z"], ["X, Y"],
  4. ["E, D"], ["G"], ["G"]
  5. ], columns=["letters"])
  6. threshold = 0.5
  7. if __name__ == "__main__":
  8. sub_ids = []
  9. for i in range(len(df)):
  10. temp = []
  11. curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
  12. for j in range(len(df)):
  13. if i == j:
  14. continue
  15. curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
  16. if not all([letter in curr_letters_i for letter in curr_letters_j]):
  17. continue
  18. if len(curr_letters_j)/len(curr_letters_i) >= threshold:
  19. temp.append(f"{j}")
  20. sub_ids.append(",".join(temp))
  21. df["sub-ids"] = sub_ids
  22. print(df)

这是代码部分的翻译,没有包括问题的回答。

英文:

One possibility (admittedly not the fastest) could be:

  1. import pandas as pd
  2. df: pd.DataFrame = pd.DataFrame([
  3. [&quot;A, B, D, E, F&quot;], [&quot;B, C&quot;], [&quot;B&quot;], [&quot;D, B&quot;], [&quot;B, D, A&quot;], [&quot;X, Y, Z&quot;], [&quot;X, Y&quot;],
  4. [&quot;E, D&quot;], [&quot;G&quot;], [&quot;G&quot;]
  5. ], columns=[&quot;letters&quot;])
  6. threshold = 0.5
  7. if __name__ == &quot;__main__&quot;:
  8. sub_ids = []
  9. for i in range(len(df)):
  10. temp = []
  11. curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
  12. for j in range(len(df)):
  13. if i == j:
  14. continue
  15. curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
  16. if not all([letter in curr_letters_i for letter in curr_letters_j]):
  17. continue
  18. if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
  19. temp.append(f&quot;{j}&quot;)
  20. sub_ids.append(&quot;,&quot;.join(temp))
  21. df[&quot;sub-ids&quot;] = sub_ids
  22. print(df)

This returns

  1. letters sub-ids
  2. 0 A, B, D, E, F 4
  3. 1 B, C 2
  4. 2 B
  5. 3 D, B 2
  6. 4 B, D, A 3
  7. 5 X, Y, Z 6
  8. 6 X, Y
  9. 7 E, D
  10. 8 G 9
  11. 9 G 8

Then, if we want to filter out the lines that are not super-sets of other, we can add new_df = df.loc[df[&quot;sub-ids&quot;] != &quot;&quot;] before the print. In this case, printing new_df gives

  1. letters sub-ids
  2. 0 A, B, D, E, F 4
  3. 1 B, C 2
  4. 3 D, B 2
  5. 4 B, D, A 3
  6. 5 X, Y, Z 6
  7. 8 G 9
  8. 9 G 8

To account for IDs that may not be numerical or in order, we may replace temp.append(f&quot;{j}&quot;) by temp.append(f&quot;{df.iloc[j][&#39;id&#39;]}&quot;). This gives:

  1. id letters sub-ids
  2. 0 ID1 A, B, D, E, F ID5
  3. 1 ID2 B, C ID3
  4. 3 ID4 D, B ID3
  5. 4 ID5 B, D, A ID4
  6. 5 ID6 X, Y, Z ID7
  7. 8 ID9 G ID10
  8. 9 ID10 G ID9

Then, if we wanted to add the similarity score as a column, we could introduce a sub_scores variable:

  1. if __name__ == &quot;__main__&quot;:
  2. sub_ids = []
  3. sub_scores = []
  4. for i in range(len(df)):
  5. temp_sub_ids = []
  6. temp_sub_scores = []
  7. curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
  8. for j in range(len(df)):
  9. if i == j:
  10. continue
  11. curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
  12. if not all([letter in curr_letters_i for letter in curr_letters_j]):
  13. continue
  14. if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
  15. temp_sub_ids.append(f&quot;{df.iloc[j][&#39;id&#39;]}&quot;)
  16. temp_sub_scores.append(f&quot;{len(curr_letters_j)/len(curr_letters_i):.2f}&quot;)
  17. sub_ids.append(&quot;,&quot;.join(temp_sub_ids))
  18. sub_scores.append(temp_sub_scores)
  19. df[&quot;sub-ids&quot;] = sub_ids
  20. df[&quot;sub-scores&quot;] = sub_scores
  21. new_df = df.loc[df[&quot;sub-ids&quot;] != &quot;&quot;]
  22. print(new_df)

This produces

  1. id letters sub-ids sub-scores
  2. 0 ID1 A, B, D, E, F ID5 [0.60]
  3. 1 ID2 B, C ID3 [0.50]
  4. 3 ID4 D, B ID3 [0.50]
  5. 4 ID5 B, D, A ID4 [0.67]
  6. 5 ID6 X, Y, Z ID7 [0.67]
  7. 8 ID9 G ID10 [1.00]
  8. 9 ID10 G ID9 [1.00]

huangapple
  • 本文由 发表于 2023年3月9日 20:11:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684444.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定