2023年5月14日 17:45:38go评论105阅读模式

英文:

How to extract sentences from a pandas dataframe which are given in a column as word-per-row

问题

Sure, here's the translation of the provided text:

什么是在大型pandas数据帧上循环的最有效方法，在其中句子以每行一个单词的形式给出，并且标点符号在另一列中？例如：

然后，我想提取并处理单独的句子，就像这样：

谢谢！

英文:

What is the most efficient way to loop over a large pandas dataframe in which sentences are given as word-per-row, and with punctuations in another column? For example:

d = {&#39;col1&#39;: [&#39;This&#39;, &#39;is&#39;, &#39;a&#39;, &#39;simple&#39;, &#39;sentence&#39;,
              &#39;This&#39;, &#39;is&#39;, &#39;another&#39;, &#39;sentence&#39;,
              &#39;This&#39;, &#39;is&#39;, &#39;the&#39;, &#39;third&#39;, &#39;sentence&#39;, 
              &#39;Is&#39;, &#39;this&#39;, &#39;a&#39;, &#39;sentence&#39;, &#39;too&#39;],
     &#39;col2&#39;: [&#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;!&#39;,
              &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;.&#39;,
              &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;...&#39;,
              &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;&#39;, &#39;?&#39;]}
df = pd.DataFrame(data=d)
df
        col1 col2
0       This     
1         is     
2          a     
3     simple     
4   sentence    !
5       This     
6         is     
7    another     
8   sentence    .
9       This     
10        is     
11       the     
12     third     
13  sentence  ...
14        Is     
15      this     
16         a     
17  sentence     
18       too    ?

Then, I would like to extract and work on individual sentences, like this:

df[0:5]
       col1 col2
0      This     
1        is     
2         a     
3    simple     
4  sentence    !

df[14:19]
        col1 col2
14        Is     
15      this     
16         a     
17  sentence     
18       too    ?

Thanks!

答案1

得分: 1

以下是您要翻译的内容：

"I would compute the position of each pair of sentence/word and make a MultiIndex :"

我将计算每个句子/单词对的位置并创建一个*MultiIndex* ：

"grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = df["col2"].ne("").shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

"Output :"

输出：

"print(out)"

打印输出：

"col1 col2"
"sentence word"
"1 1 This"
" 2 is"
" 3 a"
" 4 simple"
" 5 sentence !"
"2 1 This"
" 2 is"
" 3 another"
" 4 sentence ."
"3 1 This"
" 2 is"
" 3 the"
" 4 third"
" 5 sentence ..."
"4 1 Is"
" 2 this"
" 3 a"
" 4 sentence"
" 5 too ?"
" col1 col2"

如果您需要第2个句子作为DataFrame：

如果您需要它作为字符串：

如果您需要第2个句子的第3个单词：

"w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())"

w = "".join(out.loc[pd.IndexSlice[2, 3], :].tolist())

"print(w)"

打印(w)

"another"

"Update :"

更新：

"A follow-up question: as col2 actually contains a lot of garbage, can
empty cell .ne("") be replaced by a list .ne([".", "!", "?", > "..."]) ?"

后续问题：由于col2实际上包含很多垃圾内容，可以将空单元格.ne("")替换为列表.ne([".", "!", "?", "..."])吗？

"m = df["col2"].isin([".", "!", "?", "..."])"

m = df["col2"].isin([".", "!", "?", "..."])

"grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")"

grp = m.shift(fill_value=False).cumsum().add(1).rename("sentence")

"out = (df.assign(col2= df["col2"].where(m)).set_index(grp)"
" .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))"
" .set_index("word", append=True))"

out = (df.assign(col2= df["col2"].where(m)).set_index(grp)
.assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
.set_index("word", append=True))

英文:

I would compute the position of each pair of sentence/word and make a MultiIndex :

grp = df[&quot;col2&quot;].ne(&quot;&quot;).shift(fill_value=False).cumsum().add(1).rename(&quot;sentence&quot;)
out = (df.set_index(grp).assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
           .set_index(&quot;word&quot;, append=True))

Output :

print(out)
                   col1 col2
sentence word               
1        1         This     
         2           is     
         3            a     
         4       simple     
         5     sentence    !
2        1         This     
         2           is     
         3      another     
         4     sentence    .
3        1         This     
         2           is     
         3          the     
         4        third     
         5     sentence  ...
4        1           Is     
         2         this     
         3            a     
         4     sentence     
         5          too    ?
          col1 col2

If you need the 2nd sentence as a DataFrame :

print(out.loc[2])
          col1 col2
word               
1         This     
2           is     
3      another     
4     sentence    .

Or, if you need it as a string :

s = &quot; &quot;.join(out.loc[2].agg(&quot;&quot;.join, axis=1))
print(s)
&#39;This is another sentence.&#39;

If you need the 3rd word for the 2nd sentence :

w = &quot;&quot;.join(out.loc[pd.IndexSlice[2, 3], :].tolist())
print(w)
&quot;another&quot;

Update :

> A follow-up question: as col2 actually contains a lot of garbage, can
> empty cell .ne("") be replaced by a list .ne([".", "!", "?", > "..."]) ?

m = df[&quot;col2&quot;].isin([&quot;.&quot;, &quot;!&quot;, &quot;?&quot;, &quot;...&quot;])
grp = m.shift(fill_value=False).cumsum().add(1).rename(&quot;sentence&quot;)
out = (df.assign(col2= df[&quot;col2&quot;].where(m)).set_index(grp)
         .assign(word=lambda x: x.groupby(level=0).cumcount().add(1))
         .set_index(&quot;word&quot;, append=True))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to extract sentences from a pandas dataframe which are given in a column as word-per-row

问题

答案1

从pandas数据框中提取字符串列表的前3个元素。

Python http SSL webserver using self-signed certificate: OPENSSL_internal:WRONG_VERSION_NUMBER

如何在Python中对列表中的列表进行排序，以避免重复的名称排在一起

Problem with TensorVariable type and special functions in pymc

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。