英文:
How to extract sentences from a pandas dataframe which are given in a column as word-per-row
问题
Sure, here's the translation of the provided text:
什么是在大型pandas数据帧上循环的最有效方法,在其中句子以每行一个单词的形式给出,并且标点符号在另一列中?例如:
然后,我想提取并处理单独的句子,就像这样:
谢谢!
英文:
What is the most efficient way to loop over a large pandas dataframe in which sentences are given as word-per-row, and with punctuations in another column? For example:
d = {'col1': ['This', 'is', 'a', 'simple', 'sentence',
'This', 'is', 'another', 'sentence',
'This', 'is', 'the', 'third', 'sentence',
'Is', 'this', 'a', 'sentence', 'too'],
'col2': ['', '', '', '', '!',
'', '', '', '.',
'', '', '', '', '...',
'', '', '', '', '?']}
df = pd.DataFrame(data=d)
df
col1 col2
0 This
1 is
2 a
3 simple
4 sentence !
5 This
6 is
7 another
8 sentence .
9 This
10 is
11 the
12 third
13 sentence ...
14 Is
15 this
16 a
17 sentence
18 too ?
Then, I would like to extract and work on individual sentences, like this:
df[0:5]
col1 col2
0 This
1 is
2 a
3 simple
4 sentence !
or
df[14:19]
col1 col2
14 Is
15 this
16 a
17 sentence
18 too ?
Thanks!
答案1
得分: 1
以下是您要翻译的内容:
"I would compute the position of each pair of sentence/word and make a MultiIndex :"
我将计算每个句子/单词对的位置并创建一个*MultiIndex* :
"grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")"
grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")
"out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"
out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))
"Output :"
输出:
"print(out)"
打印输出:
"col1 col2"
"sentence word"
"1 1 This"
" 2 is"
" 3 a"
" 4 simple"
" 5 sentence !"
"2 1 This"
" 2 is"
" 3 another"
" 4 sentence ."
"3 1 This"
" 2 is"
" 3 the"
" 4 third"
" 5 sentence ..."
"4 1 Is"
" 2 this"
" 3 a"
" 4 sentence"
" 5 too ?"
" col1 col2"
如果您需要第2个句子作为DataFrame:
如果您需要它作为字符串:
如果您需要第2个句子的第3个单词:
"w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())"
w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())
"print(w)"
打印(w)
"another"
"Update :"
更新:
"A follow-up question: as col2
actually contains a lot of garbage, can
empty cell .ne("")
be replaced by a list .ne([".", "!", "?", > "..."])
?"
后续问题:由于col2
实际上包含很多垃圾内容,可以将空单元格.ne("")
替换为列表.ne([".", "!", "?", "..."])
吗?
"m = df["col2"].isin([".", "!", "?", "..."])"
m = df["col2"].isin([".", "!", "?", "..."])
"grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")"
grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")
"out = (df.assign(col2= df["col2"].where(m)).set_index(grp)"
" .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"
out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
.assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))
英文:
I would compute the position of each pair of sentence/word and make a MultiIndex :
grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")
out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))
Output :
print(out)
col1 col2
sentence word
1 1 This
2 is
3 a
4 simple
5 sentence !
2 1 This
2 is
3 another
4 sentence .
3 1 This
2 is
3 the
4 third
5 sentence ...
4 1 Is
2 this
3 a
4 sentence
5 too ?
col1 col2
If you need the 2nd sentence as a DataFrame :
print(out.loc[2])
col1 col2
word
1 This
2 is
3 another
4 sentence .
Or, if you need it as a string :
s = " ".join(out.loc[2].agg("".join, axis=1))
print(s)
'This is another sentence.'
If you need the 3rd word for the 2nd sentence :
w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())
print(w)
"another"
Update :
> A follow-up question: as col2
actually contains a lot of garbage, can
> empty cell .ne("")
be replaced by a list .ne([".", "!", "?",
?
> "..."])
m = df["col2"].isin([".", "!", "?", "..."])
grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")
out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
.assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论