使用两个数据框基于关键词生成最终数据框。

huangapple go评论72阅读模式
英文:

How to use two dataframes to get to final dataframe based on keywords?

问题

df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))

df1['ColD1'] = df1['ColD1'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColA0'] = df1['ColA0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColB0'] = df1['ColB0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColC0'] = df1['ColC0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values

df01 = df1
英文:

There are two dataframes -

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

how to get dataframe df01 from above two dataframes such that if any of the keywords from df2['ColD'] is present in df1['ColE'] it will subsequently return values from ColA, ColB, ColC, and ColD separated with comma and put the same in ColA0, ColB0, ColC0, ColD1 in df1 like -

df01 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2'],
                    'ColD1': ['d4', 'd10,d15'],
                    'ColA0': ['A1', 'A2,A3'],
                    'ColB0': ['B1', 'B2,B3'],
                    'ColC0': ['C1', 'C2,C3'],
                   })

The code that worked on -

import pandas as pd

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.',
                             'Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

df2['ColD'] = df2['ColD'].str.split(',')

df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))

df01 = df1

This gives the following output -

ColE ColF ColG ColH ColJ ColD1 ColA0 ColB0 ColC0
Lorem Ipsum is simply dummy text of the printi.. F1 G1 H1 J1 d4 A1 B1 C1
Contrary to popular belief, Lorem Ipsum is not.. F1 G1 H2 J2 d1,d10,d15 A1,A2,A3 B1,B2,B3 C1,C2,C3

The expected output is -

ColE ColF ColG ColH ColJ ColD1 ColA0 ColB0 ColC0
Lorem Ipsum is simply dummy text of the printi.. F1 G1 H1 J1 d4 A1 B1 C1
Contrary to popular belief, Lorem Ipsum is not.. F1 G1 H2 J2 d10,d15 A2,A3 B2,B3 C2,C3

How to make it like this?

答案1

得分: 1

你的问题是keyword in x对于d1为真,因为d1d10d15的一部分。你需要使用正则表达式搜索来确保只匹配完整的单词。注意,为了提高效率,你应该只计算一次ColD的总和,并将ColD1保持为列表,直到处理完毕。以下是代码的翻译部分:

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

all_keywords = df2['ColD'].str.split(',').sum()

df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)

输出:

                                                ColE ColF ColG ColH ColJ    ColD1  ColA0  ColB0  ColC0
0  Lorem Ipsum is simply dummy text of the printi...   F1   G1   H1   J1       d4     A1     B1     C1
1  Contrary to popular belief, Lorem Ipsum is not...   F1   G1   H2   J2  d10,d15  A2,A3  B2,B3  C2,C3

请注意,这只是代码的翻译部分,不包括问题或其他内容。

英文:

Your issue is that keyword in x is true for d1 since d1 is part of d10 and d15. You need to use a regex search instead to ensure you only match the complete word. Note for efficiency you should compute the ColD sum only once, and keep ColD1 as a list until you are finished with it:

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

all_keywords = df2['ColD'].str.split(',').sum()

df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)

Output:

                                                ColE ColF ColG ColH ColJ    ColD1  ColA0  ColB0  ColC0
0  Lorem Ipsum is simply dummy text of the printi...   F1   G1   H1   J1       d4     A1     B1     C1
1  Contrary to popular belief, Lorem Ipsum is not...   F1   G1   H2   J2  d10,d15  A2,A3  B2,B3  C2,C3

huangapple
  • 本文由 发表于 2023年5月18日 04:05:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76275836.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定