英文:
Creating Superset and subset according to a threshold value
问题
I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:
id | letter |
---|---|
1 | A, B, D, E, F |
2 | B, C |
3 | B |
4 | D, B |
5 | B, D, A |
6 | X, Y, Z |
7 | X, Y |
8 | E, D |
9 | G |
10 | G |
I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.
Output file:
id | letter | sub-ids |
---|---|---|
1 | A, B, D, E, F | 5 |
2 | B, C | 3 |
4 | D, B | 3 |
6 | X, Y, Z | 7 |
9 | G | 10 |
The code I tried is:
import itertools
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
out = lst[:]
for tup1,tup2 in itertools.combinations(lst, 2):
a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
if all(elem in b for elem in a) and a in out:
out.remove(a)
if all(elem in b for elem in a) and 0.5<=len(a)/len(b)<=1.0:
out.append(b)
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)
英文:
I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:
id | letter |
---|---|
1 | A, B, D, E, F |
2 | B, C |
3 | B |
4 | D, B |
5 | B, D, A |
6 | X, Y, Z |
7 | X, Y |
8 | E, D |
9 | G |
10 | G |
I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.
Output file:
id | letter | sub-ids |
---|---|---|
1 | A, B, D, E, F | 5 |
2 | B, C | 3 |
4 | D, B | 3 |
6 | X, Y, Z | 7 |
9 | G | 10 |
The code I tried is:
import itertools
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
out = lst[:]
for tup1,tup2 in itertools.combinations(lst, 2):
a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
if all(elem in b for elem in a) and a in out:
out.remove(a)
if all(elem in b for elem in a) and 0.5<=len(a)/len(b)<=1.0:
out.append(b)
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)
答案1
得分: 1
以下是代码的翻译部分:
import pandas as pd
df: pd.DataFrame = pd.DataFrame([
["A, B, D, E, F"], ["B, C"], ["B"], ["D, B"], ["B, D, A"], ["X, Y, Z"], ["X, Y"],
["E, D"], ["G"], ["G"]
], columns=["letters"])
threshold = 0.5
if __name__ == "__main__":
sub_ids = []
for i in range(len(df)):
temp = []
curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
for j in range(len(df)):
if i == j:
continue
curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
if not all([letter in curr_letters_i for letter in curr_letters_j]):
continue
if len(curr_letters_j)/len(curr_letters_i) >= threshold:
temp.append(f"{j}")
sub_ids.append(",".join(temp))
df["sub-ids"] = sub_ids
print(df)
这是代码部分的翻译,没有包括问题的回答。
英文:
One possibility (admittedly not the fastest) could be:
import pandas as pd
df: pd.DataFrame = pd.DataFrame([
["A, B, D, E, F"], ["B, C"], ["B"], ["D, B"], ["B, D, A"], ["X, Y, Z"], ["X, Y"],
["E, D"], ["G"], ["G"]
], columns=["letters"])
threshold = 0.5
if __name__ == "__main__":
sub_ids = []
for i in range(len(df)):
temp = []
curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
for j in range(len(df)):
if i == j:
continue
curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
if not all([letter in curr_letters_i for letter in curr_letters_j]):
continue
if len(curr_letters_j)/len(curr_letters_i) >= threshold:
temp.append(f"{j}")
sub_ids.append(",".join(temp))
df["sub-ids"] = sub_ids
print(df)
This returns
letters sub-ids
0 A, B, D, E, F 4
1 B, C 2
2 B
3 D, B 2
4 B, D, A 3
5 X, Y, Z 6
6 X, Y
7 E, D
8 G 9
9 G 8
Then, if we want to filter out the lines that are not super-sets of other, we can add new_df = df.loc[df["sub-ids"] != ""]
before the print
. In this case, printing new_df
gives
letters sub-ids
0 A, B, D, E, F 4
1 B, C 2
3 D, B 2
4 B, D, A 3
5 X, Y, Z 6
8 G 9
9 G 8
To account for IDs that may not be numerical or in order, we may replace temp.append(f"{j}")
by temp.append(f"{df.iloc[j]['id']}")
. This gives:
id letters sub-ids
0 ID1 A, B, D, E, F ID5
1 ID2 B, C ID3
3 ID4 D, B ID3
4 ID5 B, D, A ID4
5 ID6 X, Y, Z ID7
8 ID9 G ID10
9 ID10 G ID9
Then, if we wanted to add the similarity score as a column, we could introduce a sub_scores
variable:
if __name__ == "__main__":
sub_ids = []
sub_scores = []
for i in range(len(df)):
temp_sub_ids = []
temp_sub_scores = []
curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
for j in range(len(df)):
if i == j:
continue
curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
if not all([letter in curr_letters_i for letter in curr_letters_j]):
continue
if len(curr_letters_j)/len(curr_letters_i) >= threshold:
temp_sub_ids.append(f"{df.iloc[j]['id']}")
temp_sub_scores.append(f"{len(curr_letters_j)/len(curr_letters_i):.2f}")
sub_ids.append(",".join(temp_sub_ids))
sub_scores.append(temp_sub_scores)
df["sub-ids"] = sub_ids
df["sub-scores"] = sub_scores
new_df = df.loc[df["sub-ids"] != ""]
print(new_df)
This produces
id letters sub-ids sub-scores
0 ID1 A, B, D, E, F ID5 [0.60]
1 ID2 B, C ID3 [0.50]
3 ID4 D, B ID3 [0.50]
4 ID5 B, D, A ID4 [0.67]
5 ID6 X, Y, Z ID7 [0.67]
8 ID9 G ID10 [1.00]
9 ID10 G ID9 [1.00]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论