英文:
How to use two dataframes to get to final dataframe based on keywords?
问题
df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(lambda x: ','.join(set(x.split(',')))) # Remove duplicate values
df1['ColA0'] = df1['ColA0'].apply(lambda x: ','.join(set(x.split(',')))) # Remove duplicate values
df1['ColB0'] = df1['ColB0'].apply(lambda x: ','.join(set(x.split(',')))) # Remove duplicate values
df1['ColC0'] = df1['ColC0'].apply(lambda x: ','.join(set(x.split(',')))) # Remove duplicate values
df01 = df1
英文:
There are two dataframes -
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
how to get dataframe df01 from above two dataframes such that if any of the keywords from df2['ColD'] is present in df1['ColE'] it will subsequently return values from ColA, ColB, ColC, and ColD separated with comma and put the same in ColA0, ColB0, ColC0, ColD1 in df1 like -
df01 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2'],
'ColD1': ['d4', 'd10,d15'],
'ColA0': ['A1', 'A2,A3'],
'ColB0': ['B1', 'B2,B3'],
'ColC0': ['C1', 'C2,C3'],
})
The code that worked on -
import pandas as pd
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.',
'Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
df2['ColD'] = df2['ColD'].str.split(',')
df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))
df01 = df1
This gives the following output -
ColE | ColF | ColG | ColH | ColJ | ColD1 | ColA0 | ColB0 | ColC0 |
---|---|---|---|---|---|---|---|---|
Lorem Ipsum is simply dummy text of the printi.. | F1 | G1 | H1 | J1 | d4 | A1 | B1 | C1 |
Contrary to popular belief, Lorem Ipsum is not.. | F1 | G1 | H2 | J2 | d1,d10,d15 | A1,A2,A3 | B1,B2,B3 | C1,C2,C3 |
The expected output is -
ColE | ColF | ColG | ColH | ColJ | ColD1 | ColA0 | ColB0 | ColC0 |
---|---|---|---|---|---|---|---|---|
Lorem Ipsum is simply dummy text of the printi.. | F1 | G1 | H1 | J1 | d4 | A1 | B1 | C1 |
Contrary to popular belief, Lorem Ipsum is not.. | F1 | G1 | H2 | J2 | d10,d15 | A2,A3 | B2,B3 | C2,C3 |
How to make it like this?
答案1
得分: 1
你的问题是keyword in x
对于d1
为真,因为d1
是d10
和d15
的一部分。你需要使用正则表达式搜索来确保只匹配完整的单词。注意,为了提高效率,你应该只计算一次ColD
的总和,并将ColD1
保持为列表,直到处理完毕。以下是代码的翻译部分:
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
all_keywords = df2['ColD'].str.split(',').sum()
df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)
输出:
ColE ColF ColG ColH ColJ ColD1 ColA0 ColB0 ColC0
0 Lorem Ipsum is simply dummy text of the printi... F1 G1 H1 J1 d4 A1 B1 C1
1 Contrary to popular belief, Lorem Ipsum is not... F1 G1 H2 J2 d10,d15 A2,A3 B2,B3 C2,C3
请注意,这只是代码的翻译部分,不包括问题或其他内容。
英文:
Your issue is that keyword in x
is true for d1
since d1
is part of d10
and d15
. You need to use a regex search instead to ensure you only match the complete word. Note for efficiency you should compute the ColD
sum only once, and keep ColD1
as a list until you are finished with it:
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
all_keywords = df2['ColD'].str.split(',').sum()
df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)
Output:
ColE ColF ColG ColH ColJ ColD1 ColA0 ColB0 ColC0
0 Lorem Ipsum is simply dummy text of the printi... F1 G1 H1 J1 d4 A1 B1 C1
1 Contrary to popular belief, Lorem Ipsum is not... F1 G1 H2 J2 d10,d15 A2,A3 B2,B3 C2,C3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论