2023年5月18日 04:05:13go评论97阅读模式

英文:

How to use two dataframes to get to final dataframe based on keywords?

问题

df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColA0'] = df1['ColA0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColB0'] = df1['ColB0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df1['ColC0'] = df1['ColC0'].apply(lambda x: ','.join(set(x.split(','))))  # Remove duplicate values
df01 = df1

英文:

There are two dataframes -

df1 = pd.DataFrame({&#39;ColE&#39;: [&#39;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.&#39;,&#39;Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.&#39;],
                    &#39;ColF&#39;: [&#39;F1&#39;, &#39;F1&#39;],
                    &#39;ColG&#39;: [&#39;G1&#39;, &#39;G1&#39;],
                    &#39;ColH&#39;: [&#39;H1&#39;, &#39;H2&#39;],
                    &#39;ColJ&#39;: [&#39;J1&#39;, &#39;J2&#39;]})
df2 = pd.DataFrame({&#39;ColA&#39;: [&#39;A1&#39;, &#39;A2&#39;, &#39;A3&#39;],
                    &#39;ColB&#39;: [&#39;B1&#39;, &#39;B2&#39;, &#39;B3&#39;],
                    &#39;ColC&#39;: [&#39;C1&#39;, &#39;C2&#39;, &#39;C3&#39;],
                    &#39;ColD&#39;: [&#39;d1,d2,d3,d4,d5,d6&#39;, &#39;d7,d8,d9,d10,d11,d12&#39;, &#39;d13,d14,d15,d16,d17,d18&#39;]})

how to get dataframe df01 from above two dataframes such that if any of the keywords from df2['ColD'] is present in df1['ColE'] it will subsequently return values from ColA, ColB, ColC, and ColD separated with comma and put the same in ColA0, ColB0, ColC0, ColD1 in df1 like -

df01 = pd.DataFrame({&#39;ColE&#39;: [&#39;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.&#39;,&#39;Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.&#39;],
                    &#39;ColF&#39;: [&#39;F1&#39;, &#39;F1&#39;],
                    &#39;ColG&#39;: [&#39;G1&#39;, &#39;G1&#39;],
                    &#39;ColH&#39;: [&#39;H1&#39;, &#39;H2&#39;],
                    &#39;ColJ&#39;: [&#39;J1&#39;, &#39;J2&#39;],
                    &#39;ColD1&#39;: [&#39;d4&#39;, &#39;d10,d15&#39;],
                    &#39;ColA0&#39;: [&#39;A1&#39;, &#39;A2,A3&#39;],
                    &#39;ColB0&#39;: [&#39;B1&#39;, &#39;B2,B3&#39;],
                    &#39;ColC0&#39;: [&#39;C1&#39;, &#39;C2,C3&#39;],
                   })

The code that worked on -

import pandas as pd
df1 = pd.DataFrame({&#39;ColE&#39;: [&#39;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\&#39;s standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.&#39;,
                             &#39;Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.&#39;],
                    &#39;ColF&#39;: [&#39;F1&#39;, &#39;F1&#39;],
                    &#39;ColG&#39;: [&#39;G1&#39;, &#39;G1&#39;],
                    &#39;ColH&#39;: [&#39;H1&#39;, &#39;H2&#39;],
                    &#39;ColJ&#39;: [&#39;J1&#39;, &#39;J2&#39;]})
df2 = pd.DataFrame({&#39;ColA&#39;: [&#39;A1&#39;, &#39;A2&#39;, &#39;A3&#39;],
                    &#39;ColB&#39;: [&#39;B1&#39;, &#39;B2&#39;, &#39;B3&#39;],
                    &#39;ColC&#39;: [&#39;C1&#39;, &#39;C2&#39;, &#39;C3&#39;],
                    &#39;ColD&#39;: [&#39;d1,d2,d3,d4,d5,d6&#39;, &#39;d7,d8,d9,d10,d11,d12&#39;, &#39;d13,d14,d15,d16,d17,d18&#39;]})
df2[&#39;ColD&#39;] = df2[&#39;ColD&#39;].str.split(&#39;,&#39;)
df1[&#39;ColD1&#39;] = df1[&#39;ColE&#39;].apply(lambda x: &#39;,&#39;.join([keyword.strip() for keyword in df2[&#39;ColD&#39;].sum() if keyword in x]))
df1[&#39;ColA0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x.split(&#39;,&#39;))), &#39;ColA&#39;]))
df1[&#39;ColB0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x.split(&#39;,&#39;))), &#39;ColB&#39;]))
df1[&#39;ColC0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x.split(&#39;,&#39;))), &#39;ColC&#39;]))
df01 = df1

This gives the following output -

ColE	ColF	ColG	ColH	ColJ	ColD1	ColA0	ColB0	ColC0
Lorem Ipsum is simply dummy text of the printi..	F1	G1	H1	J1	d4	A1	B1	C1
Contrary to popular belief, Lorem Ipsum is not..	F1	G1	H2	J2	d1,d10,d15	A1,A2,A3	B1,B2,B3	C1,C2,C3

The expected output is -

ColE	ColF	ColG	ColH	ColJ	ColD1	ColA0	ColB0	ColC0
Lorem Ipsum is simply dummy text of the printi..	F1	G1	H1	J1	d4	A1	B1	C1
Contrary to popular belief, Lorem Ipsum is not..	F1	G1	H2	J2	d10,d15	A2,A3	B2,B3	C2,C3

How to make it like this?

答案1

得分: 1

你的问题是keyword in x对于d1为真，因为d1是d10和d15的一部分。你需要使用正则表达式搜索来确保只匹配完整的单词。注意，为了提高效率，你应该只计算一次ColD的总和，并将ColD1保持为列表，直到处理完毕。以下是代码的翻译部分：

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
all_keywords = df2['ColD'].str.split(',').sum()
df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)

输出：

                                                ColE ColF ColG ColH ColJ    ColD1  ColA0  ColB0  ColC0
0  Lorem Ipsum is simply dummy text of the printi...   F1   G1   H1   J1       d4     A1     B1     C1
1  Contrary to popular belief, Lorem Ipsum is not...   F1   G1   H2   J2  d10,d15  A2,A3  B2,B3  C2,C3

请注意，这只是代码的翻译部分，不包括问题或其他内容。

英文:

Your issue is that keyword in x is true for d1 since d1 is part of d10 and d15. You need to use a regex search instead to ensure you only match the complete word. Note for efficiency you should compute the ColD sum only once, and keep ColD1 as a list until you are finished with it:

df1 = pd.DataFrame({&#39;ColE&#39;: [&#39;Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.&#39;,&#39;Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.&#39;],
                    &#39;ColF&#39;: [&#39;F1&#39;, &#39;F1&#39;],
                    &#39;ColG&#39;: [&#39;G1&#39;, &#39;G1&#39;],
                    &#39;ColH&#39;: [&#39;H1&#39;, &#39;H2&#39;],
                    &#39;ColJ&#39;: [&#39;J1&#39;, &#39;J2&#39;]})
df2 = pd.DataFrame({&#39;ColA&#39;: [&#39;A1&#39;, &#39;A2&#39;, &#39;A3&#39;],
                    &#39;ColB&#39;: [&#39;B1&#39;, &#39;B2&#39;, &#39;B3&#39;],
                    &#39;ColC&#39;: [&#39;C1&#39;, &#39;C2&#39;, &#39;C3&#39;],
                    &#39;ColD&#39;: [&#39;d1,d2,d3,d4,d5,d6&#39;, &#39;d7,d8,d9,d10,d11,d12&#39;, &#39;d13,d14,d15,d16,d17,d18&#39;]})
all_keywords = df2[&#39;ColD&#39;].str.split(&#39;,&#39;).sum()
df1[&#39;ColD1&#39;] = df1[&#39;ColE&#39;].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr&#39;\b{keyword}\b&#39;, x)])
df1[&#39;ColA0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x)), &#39;ColA&#39;]))
df1[&#39;ColB0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x)), &#39;ColB&#39;]))
df1[&#39;ColC0&#39;] = df1[&#39;ColD1&#39;].apply(lambda x: &#39;,&#39;.join(df2.loc[df2[&#39;ColD&#39;].apply(lambda y: any(keyword in y for keyword in x)), &#39;ColC&#39;]))
df1[&#39;ColD1&#39;] = df1[&#39;ColD1&#39;].apply(&#39;,&#39;.join)

Output:

                                                ColE ColF ColG ColH ColJ    ColD1  ColA0  ColB0  ColC0
0  Lorem Ipsum is simply dummy text of the printi...   F1   G1   H1   J1       d4     A1     B1     C1
1  Contrary to popular belief, Lorem Ipsum is not...   F1   G1   H2   J2  d10,d15  A2,A3  B2,B3  C2,C3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用两个数据框基于关键词生成最终数据框。

问题

答案1

A faulty keyboard where 0 and 1 sometime doesn't work is replaced by "o" and "i" respectively find the mistakes in the input and correct it

Pandas根据两列中的分隔符拆分对应的行，并复制其他所有内容。

plotnine:: 在 after_stat 中引用 facet_wrap 的参数是否有更好的方式？

鸭子类型注解在Python3中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。