英文:
Extract relevant rows from pandas dataframe when duplicate column values are present
问题
我有一个如下的pandas数据框架:
id | left | top | width | height | Text |
---|---|---|---|---|---|
1 | 12 | 34 | 12 | 34 | commercial |
2 | 99 | 42 | 99 | 42 | general |
3 | 1 | 47 | 9 | 4 | liability |
4 | 10 | 69 | 32 | 67 | commercial |
5 | 99 | 72 | 79 | 88 | available |
我想要根据列值Text
提取特定行。所以我想要在列Text
中使用re.search
搜索某些关键词短语,例如liability commercial
,如果匹配成功,则提取相应的行,即第3行和第4行。因此,如果输入是liability commercial
,那么输出应该是以下提取的行:
id | left | top | width | height | Text |
---|---|---|---|---|---|
3 | 1 | 47 | 9 | 4 | liability |
4 | 10 | 69 | 32 | 67 | commercial |
请注意,列Text
可能包含重复值。所以在上面的情况中,有2行包含单词commercial
。
英文:
I have a pandas data frame as follows:
id | left | top | width | height | Text |
---|---|---|---|---|---|
1 | 12 | 34 | 12 | 34 | commercial |
2 | 99 | 42 | 99 | 42 | general |
3 | 1 | 47 | 9 | 4 | liability |
4 | 10 | 69 | 32 | 67 | commercial |
5 | 99 | 72 | 79 | 88 | available |
I want to extract specific rows based on the column value Text
. So I want to search for certain keyphrases like liability commercial
using re.search
in the column Text
and if I get a match then extract the rows i.e. 3rd and 4th row. So if the input is liability commercial
then the output should be the following rows extracted:
id | left | top | width | height | Text |
---|---|---|---|---|---|
3 | 1 | 47 | 9 | 4 | liability |
4 | 10 | 69 | 32 | 67 | commercial |
Keep in mind that the column Text
may contain duplicate values. So in the above case, there are 2 rows with the word commerial
present.
Thanks in advance!
答案1
得分: 1
以下是翻译好的代码部分:
使用:
phrase = 'liability commercial'
#按子字符串匹配 - 使用空格分隔的值
m = df['Text'].str.contains(phrase.replace(' ','|'))
#按使用空格分隔的值匹配
m = df['Text'].isin(phrase.split())
#根据掩码筛选行并获取Text列中的最后重复值
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
或者,如果需要根据匹配行分组,可以更改条件掩码,这里不考虑分割值的位置和可能的重复项:
phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())
df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
如果需要根据分割值的确切匹配进行匹配,可以修改[此解决方案](https://stackoverflow.com/a/49005205/2901002):
phrase = 'liability commercial'
#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
希望这对你有所帮助!
英文:
Use:
phrase = 'liability commercial'
#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())
#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
Or if need groups by matched rows by conditions change mask, here position of splitted values ad possible duplicates not counts:
phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())
df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
If need match by exactly matched by splitted values is possible modify this solution:
phrase = 'liability commercial'
#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
id left top width height Text
2 3 1 47 9 4 liability
3 4 10 69 32 67 commercial
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论