How to extract sentences from a pandas dataframe which are given in a column as word-per-row

huangapple go评论57阅读模式
英文:

How to extract sentences from a pandas dataframe which are given in a column as word-per-row

问题

Sure, here's the translation of the provided text:

什么是在大型pandas数据帧上循环的最有效方法,在其中句子以每行一个单词的形式给出,并且标点符号在另一列中?例如:

然后,我想提取并处理单独的句子,就像这样:

谢谢!

英文:

What is the most efficient way to loop over a large pandas dataframe in which sentences are given as word-per-row, and with punctuations in another column? For example:

d = {'col1': ['This', 'is', 'a', 'simple', 'sentence',
              'This', 'is', 'another', 'sentence',
              'This', 'is', 'the', 'third', 'sentence', 
              'Is', 'this', 'a', 'sentence', 'too'],
     'col2': ['', '', '', '', '!',
              '', '', '', '.',
              '', '', '', '', '...',
              '', '', '', '', '?']}

df = pd.DataFrame(data=d)
df
        col1 col2
0       This     
1         is     
2          a     
3     simple     
4   sentence    !
5       This     
6         is     
7    another     
8   sentence    .
9       This     
10        is     
11       the     
12     third     
13  sentence  ...
14        Is     
15      this     
16         a     
17  sentence     
18       too    ?

Then, I would like to extract and work on individual sentences, like this:

df[0:5]
       col1 col2
0      This     
1        is     
2         a     
3    simple     
4  sentence    !

or

df[14:19]
        col1 col2
14        Is     
15      this     
16         a     
17  sentence     
18       too    ?

Thanks!

答案1

得分: 1

以下是您要翻译的内容:

"I would compute the position of each pair of sentence/word and make a MultiIndex :"

我将计算每个句子/单词对的位置并创建一个*MultiIndex* :

"grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

"Output :"

输出:

"print(out)"

打印输出:

"col1 col2"
"sentence word"
"1 1 This"
" 2 is"
" 3 a"
" 4 simple"
" 5 sentence !"
"2 1 This"
" 2 is"
" 3 another"
" 4 sentence ."
"3 1 This"
" 2 is"
" 3 the"
" 4 third"
" 5 sentence ..."
"4 1 Is"
" 2 this"
" 3 a"
" 4 sentence"
" 5 too ?"
" col1 col2"

如果您需要第2个句子作为DataFrame:

如果您需要它作为字符串:

如果您需要第2个句子的第3个单词:

"w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())"

w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())

"print(w)"

打印(w)

"another"

"Update :"

更新:

"A follow-up question: as col2 actually contains a lot of garbage, can
empty cell .ne("") be replaced by a list .ne([".", "!", "?", > "..."]) ?"

后续问题:由于col2实际上包含很多垃圾内容,可以将空单元格.ne("")替换为列表.ne([".", "!", "?", "..."])吗?

"m = df["col2"].isin([".", "!", "?", "..."])"

m = df["col2"].isin([".", "!", "?", "..."])

"grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.assign(col2= df["col2"].where(m)).set_index(grp)"
" .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
.assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

英文:

I would compute the position of each pair of sentence/word and make a MultiIndex :

grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")

out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
           .set_index("word", append=True))

Output :

print(out)

                   col1 col2
sentence word               
1        1         This     
         2           is     
         3            a     
         4       simple     
         5     sentence    !
2        1         This     
         2           is     
         3      another     
         4     sentence    .
3        1         This     
         2           is     
         3          the     
         4        third     
         5     sentence  ...
4        1           Is     
         2         this     
         3            a     
         4     sentence     
         5          too    ?
          col1 col2

If you need the 2nd sentence as a DataFrame :

print(out.loc[2])

          col1 col2
word               
1         This     
2           is     
3      another     
4     sentence    .

Or, if you need it as a string :

s = " ".join(out.loc[2].agg("".join, axis=1))

print(s)

'This is another sentence.'

If you need the 3rd word for the 2nd sentence :

w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())

print(w)

"another"

Update :

> A follow-up question: as col2 actually contains a lot of garbage, can
> empty cell .ne("") be replaced by a list .ne([".", "!", "?",
> "..."])
?

m = df["col2"].isin([".", "!", "?", "..."])

grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")

out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
         .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
         .set_index("word", append=True))

huangapple
  • 本文由 发表于 2023年5月14日 17:45:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76246812.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定