2023年6月8日 01:50:42go评论96阅读模式

英文:

Pandas dataframe - Separate sentences by NaN values

问题

我有一个Pandas数据框，每行都有一个名为“Word”的列中的单词。
每个句子的分隔符是一个空字符串""，所以我正在使用skip_blank_lines来查看分隔。

df = pd.read_csv("Data-June-2023.txt", sep=" ", skip_blank_lines=False)
df.tail(20)
Index	Word	_	_	Tag
0	I	_	_	O
1	am	_	_	O
2	from	_	_	O
3	Madrid	_	_	B-City
4	NaN	  NaN  NaN	NaN
5	Alice	_	_	B-Person
6	likes	_	_	O
7	Bob	_	_	B-Person

我想要创建一个名为"Sentence #"的新列，通过迭代空行或NaN值。
在每个"Word"列中的NaN值处，它将创建一个新的句子计数，如句子：1，句子：2，句子：3...等等

Index	Sentence #	Word	_	_	Tag
0	Sentence: 1	I	_	_	O
1		        am	_	_	O
2		        from	_	_	O
3		        Oxford	_	_	B-City
4		        NaN	NaN	NaN	NaN
5	Sentence: 2	Alice	_	_	B-Person
6		        likes	_	_	O
7		        Bob	_	_	B-Person
8		        NaN	NaN	NaN	NaN
9	Sentence: 3	Alice	_	_	B-Person

提前谢谢！

英文:

I have a Pandas data frame with words on each line in a column called "Word".
The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.

df = pd.read_csv(&quot;Data-June-2023.txt&quot;, sep=&quot; &quot;,skip_blank_lines=False)
df.tail(20)
Index	Word	_	_	Tag
0	I	_	_	O
1	am	_	_	O
2	from	_	_	O
3	Madrid	_	_	B-City
4	NaN	  NaN  NaN	NaN
5	Alice	_	_	B-Person
6	likes	_	_	O
7	Bob	_	_	B-Person

I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values.
At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc

Index	Sentence #	Word	_	_	Tag
0	Sentence: 1	I	_	_	O
1		        am	_	_	O
2		        from	_	_	O
3		        Oxford	_	_	B-City
4		        NaN	NaN	NaN	NaN
5	Sentence: 2	Alice	_	_	B-Person
6		        likes	_	_	O
7		        Bob	_	_	B-Person
8		        NaN	NaN	NaN	NaN
9	Sentence: 3	Alice	_	_	B-Person

Thank you in advance!

答案1

得分: 0

我会使用布尔索引：

m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

输出：

   Index    Word    _    _       Tag     Sentence
0      0       I    _    _         O  Sentence: 1
1      1      am    _    _         O          NaN
2      2    from    _    _         O          NaN
3      3  Madrid    _    _    B-City          NaN
4      4     NaN  NaN  NaN       NaN          NaN
5      5   Alice    _    _  B-Person  Sentence: 2
6      6   likes    _    _         O          NaN
7      7     Bob    _    _  B-Person          NaN

英文:

I would use boolean indexing:

m = df[&#39;Word&#39;].isna().shift(fill_value=True)
df.loc[m, &#39;Sentence&#39;] = m.cumsum().astype(str).radd(&#39;Sentence: &#39;)

Output:

   Index    Word    _    _       Tag     Sentence
0      0       I    _    _         O  Sentence: 1
1      1      am    _    _         O          NaN
2      2    from    _    _         O          NaN
3      3  Madrid    _    _    B-City          NaN
4      4     NaN  NaN  NaN       NaN          NaN
5      5   Alice    _    _  B-Person  Sentence: 2
6      6   likes    _    _         O          NaN
7      7     Bob    _    _  B-Person          NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas数据框架 – 通过NaN值分隔句子

问题

答案1

如何正确确定Pandas数据框中的一列是否基于另一列中的字符串替换了值

ImportError: 无法从’keras.models’导入’name’ ‘Input’

Python BeautifulSoup Span Scraping

如何获取包含特定标题最大值的列在pandas数据帧中？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。