2023年3月9日 20:11:28go评论100阅读模式

英文:

Creating Superset and subset according to a threshold value

问题

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id	letter
1	A, B, D, E, F
2	B, C
3	B
4	D, B
5	B, D, A
6	X, Y, Z
7	X, Y
8	E, D
9	G
10	G

I would like to show and store that relation in the file( including subsets and their supersets) according to a threshold value(for ex. equal or bigger than 50% score). For e.g.:
'B' is subset of 'A,B,D,E,F' but threshold is 1/5=0.2 so it wont be included in subset. Also id of 'A,B,D,E,F' wont be included itself as subset.

Output file:

id	letter	sub-ids
1	A, B, D, E, F	5
2	B, C	3
4	D, B	3
6	X, Y, Z	7
9	G	10

The code I tried is:

import itertools
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'letter': ['A, B, D, E, F','B, C','B','D, B','B, D, A','X, Y, Z','X, Y','E, D','G','G']})
lst = df['letter'].values.tolist()
lst = list(tuple(item.strip() for item in x.split(',')) for x in lst)
out = lst[:] 
for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) < len(tup2) else (tup2, tup1)
    if all(elem in b for elem in a) and a in out:
        out.remove(a)
    if all(elem in b for elem in a) and 0.5<=len(a)/len(b)<=1.0:
        out.append(b)
filt = list(map(', '.join, out))
df2 = df.loc[df['letter'].isin(filt), :].drop_duplicates(subset='letter')
print(df2)

英文:

I am trying to find supersets and subsets for values included in a column(here for letter column) from an excel file. The data looks like this:

id	letter
1	A, B, D, E, F
2	B, C
3	B
4	D, B
5	B, D, A
6	X, Y, Z
7	X, Y
8	E, D
9	G
10	G

id	letter	sub-ids
1	A, B, D, E, F	5
2	B, C	3
4	D, B	3
6	X, Y, Z	7
9	G	10

The code I tried is:

import itertools
df = pd.DataFrame({
    &#39;id&#39;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    &#39;letter&#39;: [&#39;A, B, D, E, F&#39;,&#39;B, C&#39;,&#39;B&#39;,&#39;D, B&#39;,&#39;B, D, A&#39;,&#39;X, Y, Z&#39;,&#39;X, Y&#39;,&#39;E, D&#39;,&#39;G&#39;,&#39;G&#39;]})
lst = df[&#39;letter&#39;].values.tolist()
lst = list(tuple(item.strip() for item in x.split(&#39;,&#39;)) for x in lst)
out = lst[:] 
for tup1,tup2 in itertools.combinations(lst, 2):
    a, b = (tup1, tup2) if len(tup1) &lt; len(tup2) else (tup2, tup1)
    if all(elem in b for elem in a) and a in out:
        out.remove(a)
    if all(elem in b for elem in a) and 0.5&lt;=len(a)/len(b)&lt;=1.0:
        out.append(b)
filt = list(map(&#39;, &#39;.join, out))
df2 = df.loc[df[&#39;letter&#39;].isin(filt), :].drop_duplicates(subset=&#39;letter&#39;)
print(df2)

答案1

得分: 1

以下是代码的翻译部分：

import pandas as pd
df: pd.DataFrame = pd.DataFrame([
    ["A, B, D, E, F"], ["B, C"], ["B"], ["D, B"], ["B, D, A"], ["X, Y, Z"], ["X, Y"],
    ["E, D"], ["G"], ["G"]
], columns=["letters"])
threshold = 0.5
if __name__ == "__main__":
    sub_ids = []
    for i in range(len(df)):
        temp = []
        curr_letters_i = df.iloc[i]["letters"].replace(" ", "").split(",")
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j]["letters"].replace(" ", "").split(",")
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) >= threshold:
                temp.append(f"{j}")
        sub_ids.append(",".join(temp))
    df["sub-ids"] = sub_ids
    print(df)

这是代码部分的翻译，没有包括问题的回答。

英文:

One possibility (admittedly not the fastest) could be:

import pandas as pd
df: pd.DataFrame = pd.DataFrame([
    [&quot;A, B, D, E, F&quot;], [&quot;B, C&quot;], [&quot;B&quot;], [&quot;D, B&quot;], [&quot;B, D, A&quot;], [&quot;X, Y, Z&quot;], [&quot;X, Y&quot;],
    [&quot;E, D&quot;], [&quot;G&quot;], [&quot;G&quot;]
], columns=[&quot;letters&quot;])
threshold = 0.5
if __name__ == &quot;__main__&quot;:
    sub_ids = []
    for i in range(len(df)):
        temp = []
        curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
                temp.append(f&quot;{j}&quot;)
        sub_ids.append(&quot;,&quot;.join(temp))
    df[&quot;sub-ids&quot;] = sub_ids
    print(df)

This returns

         letters sub-ids
0  A, B, D, E, F       4
1           B, C       2
2              B        
3           D, B       2
4        B, D, A       3
5        X, Y, Z       6
6           X, Y        
7           E, D        
8              G       9
9              G       8

Then, if we want to filter out the lines that are not super-sets of other, we can add new_df = df.loc[df["sub-ids"] != ""] before the print. In this case, printing new_df gives

         letters sub-ids
0  A, B, D, E, F       4
1           B, C       2
3           D, B       2
4        B, D, A       3
5        X, Y, Z       6
8              G       9
9              G       8

To account for IDs that may not be numerical or in order, we may replace temp.append(f"{j}") by temp.append(f"{df.iloc[j]['id']}"). This gives:

     id        letters sub-ids
0   ID1  A, B, D, E, F     ID5
1   ID2           B, C     ID3
3   ID4           D, B     ID3
4   ID5        B, D, A     ID4
5   ID6        X, Y, Z     ID7
8   ID9              G    ID10
9  ID10              G     ID9

Then, if we wanted to add the similarity score as a column, we could introduce a sub_scores variable:

if __name__ == &quot;__main__&quot;:
    sub_ids = []
    sub_scores = []
    for i in range(len(df)):
        temp_sub_ids = []
        temp_sub_scores = []
        curr_letters_i = df.iloc[i][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
        for j in range(len(df)):
            if i == j:
                continue
            curr_letters_j = df.iloc[j][&quot;letters&quot;].replace(&quot; &quot;, &quot;&quot;).split(&quot;,&quot;)
            if not all([letter in curr_letters_i for letter in curr_letters_j]):
                continue
            if len(curr_letters_j)/len(curr_letters_i) &gt;= threshold:
                temp_sub_ids.append(f&quot;{df.iloc[j][&#39;id&#39;]}&quot;)
                temp_sub_scores.append(f&quot;{len(curr_letters_j)/len(curr_letters_i):.2f}&quot;)
        sub_ids.append(&quot;,&quot;.join(temp_sub_ids))
        sub_scores.append(temp_sub_scores)
    df[&quot;sub-ids&quot;] = sub_ids
    df[&quot;sub-scores&quot;] = sub_scores
    new_df = df.loc[df[&quot;sub-ids&quot;] != &quot;&quot;]
    print(new_df)

This produces

     id        letters sub-ids sub-scores
0   ID1  A, B, D, E, F     ID5     [0.60]
1   ID2           B, C     ID3     [0.50]
3   ID4           D, B     ID3     [0.50]
4   ID5        B, D, A     ID4     [0.67]
5   ID6        X, Y, Z     ID7     [0.67]
8   ID9              G    ID10     [1.00]
9  ID10              G     ID9     [1.00]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据阈值创建超集和子集。

问题

答案1

如何从特定的构建触发器中提取最新云构建运行的状态？

golang defining dict like python with and appending value to list in dict

如何在 PySpark 数据帧中更改具有数组结构的列值

如何在数据框中为特定列、特定日期（DatetimeIndex）更改pandas行值？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。