2023年7月6日 14:43:16go评论101阅读模式

英文:

Extract relevant rows from pandas dataframe when duplicate column values are present

问题

我有一个如下的pandas数据框架：

id	left	top	width	height	Text
1	12	34	12	34	commercial
2	99	42	99	42	general
3	1	47	9	4	liability
4	10	69	32	67	commercial
5	99	72	79	88	available

我想要根据列值Text 提取特定行。所以我想要在列Text中使用re.search搜索某些关键词短语，例如liability commercial，如果匹配成功，则提取相应的行，即第3行和第4行。因此，如果输入是liability commercial，那么输出应该是以下提取的行：

id	left	top	width	height	Text
3	1	47	9	4	liability
4	10	69	32	67	commercial

请注意，列Text 可能包含重复值。所以在上面的情况中，有2行包含单词commercial。

英文:

I have a pandas data frame as follows:

id	left	top	width	height	Text
1	12	34	12	34	commercial
2	99	42	99	42	general
3	1	47	9	4	liability
4	10	69	32	67	commercial
5	99	72	79	88	available

I want to extract specific rows based on the column value Text. So I want to search for certain keyphrases like liability commercial using re.search in the column Text and if I get a match then extract the rows i.e. 3rd and 4th row. So if the input is liability commercial then the output should be the following rows extracted:

id	left	top	width	height	Text
3	1	47	9	4	liability
4	10	69	32	67	commercial

Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.

Thanks in advance!

答案1

得分: 1

以下是翻译好的代码部分：

使用：
phrase = 'liability commercial'
#按子字符串匹配 - 使用空格分隔的值
m = df['Text'].str.contains(phrase.replace(' ','|'))
#按使用空格分隔的值匹配
m = df['Text'].isin(phrase.split())
#根据掩码筛选行并获取Text列中的最后重复值
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial
或者，如果需要根据匹配行分组，可以更改条件掩码，这里不考虑分割值的位置和可能的重复项：
phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())
df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial
如果需要根据分割值的确切匹配进行匹配，可以修改[此解决方案](https://stackoverflow.com/a/49005205/2901002)：
phrase = 'liability commercial'
#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)
def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c
arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

希望这对你有所帮助！

英文:

Use:

phrase = &#39;liability commercial&#39;
#match by substrings - splitted values by spaces
m = df[&#39;Text&#39;].str.contains(phrase.replace(&#39; &#39;,&#39;|&#39;))
#match by splitted values by spaces
m = df[&#39;Text&#39;].isin(phrase.split())
#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates([&#39;Text&#39;], keep=&#39;last&#39;)
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

Or if need groups by matched rows by conditions change mask, here position of splitted values ad possible duplicates not counts:

phrase = &#39;liability commercial&#39;
m = ~df[&#39;Text&#39;].str.contains(phrase.replace(&#39; &#39;,&#39;|&#39;))
#m = ~df[&#39;Text&#39;].isin(phrase.split())
df = df[m.cumsum().duplicated(keep=False) &amp; ~m]
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

If need match by exactly matched by splitted values is possible modify this solution:

phrase = &#39;liability commercial&#39;
#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df[&#39;Text&#39;].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从 pandas 数据框中提取相关行，当存在重复列数值时。

问题

答案1

“TensorFlow”无法使用PIP安装？”找不到适配的发行版本tensorflow”

Python: 我尝试使用groupby的value_count时出现KeyError: ‘Yes’错误。

在数据框A中通过从数据框B的数值进行迭代来设置数值。

Auto-switching Python Virtual Environments in Visual Studio Code per Directory within a Workspace.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。