根据阈值创建超集和子集。

huangapple go评论73阅读模式
英文:

Creating Superset and subset according to a threshold value

问题

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
9 G
10 G

I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.

Output file:

id letter sub-ids
1 A, B, D, E, F 5
2 B, C 3
4 D, B 3
6 X, Y, Z 7
9 G 10

The code I tried is:

import itertools
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})

lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
out = lst[:] 

for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
    if all(elem in b for elem in a) and a in out:
        out.remove(a)
    if all(elem in b for elem in a) and 0.5<=len(a)/len(b)<=1.0:
        out.append(b)
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)
英文:

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id letter
1 A, B, D, E, F
2 B, C
3 B
4 D, B
5 B, D, A
6 X, Y, Z
7 X, Y
8 E, D
9 G
10 G

I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.
Output file:

id letter sub-ids
1 A, B, D, E, F 5
2 B, C 3
4 D, B 3
6 X, Y, Z 7
9 G 10

The code I tried is:

import itertools
df = pd.DataFrame({
    &#39;id&#39;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    &#39;letter&#39;: [&#39;A, B, D, E, F&#39;,&#39;B, C&#39;,&#39;B&#39;,&#39;D, B&#39;,&#39;B, D, A&#39;,&#39;X, Y, Z&#39;,&#39;X, Y&#39;,&#39;E, D&#39;,&#39;G&#39;,&#39;G&#39;]})

lst = df[&#39;letter&#39;].values.tolist()
lst = list(tuple(item.strip() for item in x.split(&#39;,&#39;)) for x in lst)
out = lst[:] 

for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) &lt; len(tup2) else (tup2, tup1)
    if all(elem in b for elem in a) and a in out:
        out.remove(a)
    if all(elem in b for elem in a) and 0.5&lt;=len(a)/len(b)&lt;=1.0:
        out.append(b)
filt = list(map(&#39;, &#39;.join, out))
df2 = df.loc[df[&#39;letter&#39;].isin(filt), :].drop_duplicates(subset=&#39;letter&#39;)
print(df2)

答案1

得分: 1

以下是代码的翻译部分:

import pandas as pd

df: pd.DataFrame = pd.DataFrame([
    ["A, B, D, E, F"], ["B, C"], ["B"], ["D, B"], ["B, D, A"], ["X, Y, Z"], ["X, Y"],
    ["E, D"], ["G"], ["G"]
], columns=["letters"])

threshold = 0.5


if __name__ == "__main__":
    sub_ids = []
    for i in range(len(df)):
        temp = []
        curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) >= threshold:
                temp.append(f"{j}")
        sub_ids.append(",".join(temp))
    df["sub-ids"] = sub_ids
    print(df)

这是代码部分的翻译,没有包括问题的回答。

英文:

One possibility (admittedly not the fastest) could be:

import pandas as pd

df: pd.DataFrame = pd.DataFrame([
    [&quot;A, B, D, E, F&quot;], [&quot;B, C&quot;], [&quot;B&quot;], [&quot;D, B&quot;], [&quot;B, D, A&quot;], [&quot;X, Y, Z&quot;], [&quot;X, Y&quot;],
    [&quot;E, D&quot;], [&quot;G&quot;], [&quot;G&quot;]
], columns=[&quot;letters&quot;])

threshold = 0.5


if __name__ == &quot;__main__&quot;:
    sub_ids = []
    for i in range(len(df)):
        temp = []
        curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
                temp.append(f&quot;{j}&quot;)
        sub_ids.append(&quot;,&quot;.join(temp))
    df[&quot;sub-ids&quot;] = sub_ids
    print(df)

This returns

         letters sub-ids
0  A, B, D, E, F       4
1           B, C       2
2              B        
3           D, B       2
4        B, D, A       3
5        X, Y, Z       6
6           X, Y        
7           E, D        
8              G       9
9              G       8

Then, if we want to filter out the lines that are not super-sets of other, we can add new_df = df.loc[df[&quot;sub-ids&quot;] != &quot;&quot;] before the print. In this case, printing new_df gives

         letters sub-ids
0  A, B, D, E, F       4
1           B, C       2
3           D, B       2
4        B, D, A       3
5        X, Y, Z       6
8              G       9
9              G       8

To account for IDs that may not be numerical or in order, we may replace temp.append(f&quot;{j}&quot;) by temp.append(f&quot;{df.iloc[j][&#39;id&#39;]}&quot;). This gives:

     id        letters sub-ids
0   ID1  A, B, D, E, F     ID5
1   ID2           B, C     ID3
3   ID4           D, B     ID3
4   ID5        B, D, A     ID4
5   ID6        X, Y, Z     ID7
8   ID9              G    ID10
9  ID10              G     ID9

Then, if we wanted to add the similarity score as a column, we could introduce a sub_scores variable:

if __name__ == &quot;__main__&quot;:
    sub_ids = []
    sub_scores = []
    for i in range(len(df)):
        temp_sub_ids = []
        temp_sub_scores = []
        curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
                temp_sub_ids.append(f&quot;{df.iloc[j][&#39;id&#39;]}&quot;)
                temp_sub_scores.append(f&quot;{len(curr_letters_j)/len(curr_letters_i):.2f}&quot;)
        sub_ids.append(&quot;,&quot;.join(temp_sub_ids))
        sub_scores.append(temp_sub_scores)
    df[&quot;sub-ids&quot;] = sub_ids
    df[&quot;sub-scores&quot;] = sub_scores
    new_df = df.loc[df[&quot;sub-ids&quot;] != &quot;&quot;]
    print(new_df)

This produces

     id        letters sub-ids sub-scores
0   ID1  A, B, D, E, F     ID5     [0.60]
1   ID2           B, C     ID3     [0.50]
3   ID4           D, B     ID3     [0.50]
4   ID5        B, D, A     ID4     [0.67]
5   ID6        X, Y, Z     ID7     [0.67]
8   ID9              G    ID10     [1.00]
9  ID10              G     ID9     [1.00]

huangapple
  • 本文由 发表于 2023年3月9日 20:11:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684444.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定