从 pandas 数据框中提取相关行,当存在重复列数值时。

huangapple go评论75阅读模式
英文:

Extract relevant rows from pandas dataframe when duplicate column values are present

问题

我有一个如下的pandas数据框架:

id left top width height Text
1 12 34 12 34 commercial
2 99 42 99 42 general
3 1 47 9 4 liability
4 10 69 32 67 commercial
5 99 72 79 88 available

我想要根据列值Text 提取特定行。所以我想要在列Text中使用re.search搜索某些关键词短语,例如liability commercial,如果匹配成功,则提取相应的行,即第3行和第4行。因此,如果输入是liability commercial,那么输出应该是以下提取的行:

id left top width height Text
3 1 47 9 4 liability
4 10 69 32 67 commercial

请注意,列Text 可能包含重复值。所以在上面的情况中,有2行包含单词commercial

英文:

I have a pandas data frame as follows:

id left top width height Text
1 12 34 12 34 commercial
2 99 42 99 42 general
3 1 47 9 4 liability
4 10 69 32 67 commercial
5 99 72 79 88 available

I want to extract specific rows based on the column value Text. So I want to search for certain keyphrases like liability commercial using re.search in the column Text and if I get a match then extract the rows i.e. 3rd and 4th row. So if the input is liability commercial then the output should be the following rows extracted:

id left top width height Text
3 1 47 9 4 liability
4 10 69 32 67 commercial

Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.

Thanks in advance!

答案1

得分: 1

以下是翻译好的代码部分:

使用

phrase = 'liability commercial'

#按子字符串匹配 - 使用空格分隔的值
m = df['Text'].str.contains(phrase.replace(' ','|'))
#按使用空格分隔的值匹配
m = df['Text'].isin(phrase.split())

#根据掩码筛选行并获取Text列中的最后重复值
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

或者如果需要根据匹配行分组可以更改条件掩码这里不考虑分割值的位置和可能的重复项

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())

df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

如果需要根据分割值的确切匹配进行匹配可以修改[此解决方案](https://stackoverflow.com/a/49005205/2901002)

phrase = 'liability commercial'

#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

希望这对你有所帮助!

英文:

Use:

phrase = 'liability commercial'
#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())
#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

Or if need groups by matched rows by conditions change mask, here position of splitted values ad possible duplicates not counts:

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())
df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

If need match by exactly matched by splitted values is possible modify this solution:

phrase = 'liability commercial'
#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

huangapple
  • 本文由 发表于 2023年7月6日 14:43:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76626134.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定