How to extract sentences from a pandas dataframe which are given in a column as word-per-row

huangapple go评论105阅读模式
英文:

How to extract sentences from a pandas dataframe which are given in a column as word-per-row

问题

Sure, here's the translation of the provided text:

什么是在大型pandas数据帧上循环的最有效方法,在其中句子以每行一个单词的形式给出,并且标点符号在另一列中?例如:

然后,我想提取并处理单独的句子,就像这样:

谢谢!

英文:

What is the most efficient way to loop over a large pandas dataframe in which sentences are given as word-per-row, and with punctuations in another column? For example:

  1. d = {'col1': ['This', 'is', 'a', 'simple', 'sentence',
  2. 'This', 'is', 'another', 'sentence',
  3. 'This', 'is', 'the', 'third', 'sentence',
  4. 'Is', 'this', 'a', 'sentence', 'too'],
  5. 'col2': ['', '', '', '', '!',
  6. '', '', '', '.',
  7. '', '', '', '', '...',
  8. '', '', '', '', '?']}
  9. df = pd.DataFrame(data=d)
  10. df
  11. col1 col2
  12. 0 This
  13. 1 is
  14. 2 a
  15. 3 simple
  16. 4 sentence !
  17. 5 This
  18. 6 is
  19. 7 another
  20. 8 sentence .
  21. 9 This
  22. 10 is
  23. 11 the
  24. 12 third
  25. 13 sentence ...
  26. 14 Is
  27. 15 this
  28. 16 a
  29. 17 sentence
  30. 18 too ?

Then, I would like to extract and work on individual sentences, like this:

  1. df[0:5]
  2. col1 col2
  3. 0 This
  4. 1 is
  5. 2 a
  6. 3 simple
  7. 4 sentence !

or

  1. df[14:19]
  2. col1 col2
  3. 14 Is
  4. 15 this
  5. 16 a
  6. 17 sentence
  7. 18 too ?

Thanks!

答案1

得分: 1

以下是您要翻译的内容:

"I would compute the position of each pair of sentence/word and make a MultiIndex :"

我将计算每个句子/单词对的位置并创建一个*MultiIndex* :

"grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

"Output :"

输出:

"print(out)"

打印输出:

"col1 col2"
"sentence word"
"1 1 This"
" 2 is"
" 3 a"
" 4 simple"
" 5 sentence !"
"2 1 This"
" 2 is"
" 3 another"
" 4 sentence ."
"3 1 This"
" 2 is"
" 3 the"
" 4 third"
" 5 sentence ..."
"4 1 Is"
" 2 this"
" 3 a"
" 4 sentence"
" 5 too ?"
" col1 col2"

如果您需要第2个句子作为DataFrame:

如果您需要它作为字符串:

如果您需要第2个句子的第3个单词:

"w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())"

w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())

"print(w)"

打印(w)

"another"

"Update :"

更新:

"A follow-up question: as col2 actually contains a lot of garbage, can
empty cell .ne("") be replaced by a list .ne([".", "!", "?", > "..."]) ?"

后续问题:由于col2实际上包含很多垃圾内容,可以将空单元格.ne("")替换为列表.ne([".", "!", "?", "..."])吗?

"m = df["col2"].isin([".", "!", "?", "..."])"

m = df["col2"].isin([".", "!", "?", "..."])

"grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.assign(col2= df["col2"].where(m)).set_index(grp)"
" .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
.assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

英文:

I would compute the position of each pair of sentence/word and make a MultiIndex :

  1. grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")
  2. out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
  3. .set_index("word", append=True))

Output :

  1. print(out)
  2. col1 col2
  3. sentence word
  4. 1 1 This
  5. 2 is
  6. 3 a
  7. 4 simple
  8. 5 sentence !
  9. 2 1 This
  10. 2 is
  11. 3 another
  12. 4 sentence .
  13. 3 1 This
  14. 2 is
  15. 3 the
  16. 4 third
  17. 5 sentence ...
  18. 4 1 Is
  19. 2 this
  20. 3 a
  21. 4 sentence
  22. 5 too ?
  23. col1 col2

If you need the 2nd sentence as a DataFrame :

  1. print(out.loc[2])
  2. col1 col2
  3. word
  4. 1 This
  5. 2 is
  6. 3 another
  7. 4 sentence .

Or, if you need it as a string :

  1. s = " ".join(out.loc[2].agg("".join, axis=1))
  2. print(s)
  3. 'This is another sentence.'

If you need the 3rd word for the 2nd sentence :

  1. w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())
  2. print(w)
  3. "another"

Update :

> A follow-up question: as col2 actually contains a lot of garbage, can
> empty cell .ne("") be replaced by a list .ne([".", "!", "?",
> "..."])
?

  1. m = df["col2"].isin([".", "!", "?", "..."])
  2. grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")
  3. out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
  4. .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
  5. .set_index("word", append=True))

huangapple
  • 本文由 发表于 2023年5月14日 17:45:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76246812.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定