Pandas数据框架 – 通过NaN值分隔句子

huangapple go评论64阅读模式
英文:

Pandas dataframe - Separate sentences by NaN values

问题

我有一个Pandas数据框,每行都有一个名为“Word”的列中的单词。
每个句子的分隔符是一个空字符串"",所以我正在使用skip_blank_lines来查看分隔。

df = pd.read_csv("Data-June-2023.txt", sep=" ", skip_blank_lines=False)
df.tail(20)

Index	Word	_	_	Tag

0	I	_	_	O
1	am	_	_	O
2	from	_	_	O
3	Madrid	_	_	B-City
4	NaN	  NaN  NaN	NaN
5	Alice	_	_	B-Person
6	likes	_	_	O
7	Bob	_	_	B-Person

我想要创建一个名为"Sentence #"的新列,通过迭代空行或NaN值。
在每个"Word"列中的NaN值处,它将创建一个新的句子计数,如句子:1,句子:2,句子:3...等等

Index	Sentence #	Word	_	_	Tag

0	Sentence: 1	I	_	_	O
1		        am	_	_	O
2		        from	_	_	O
3		        Oxford	_	_	B-City
4		        NaN	NaN	NaN	NaN
5	Sentence: 2	Alice	_	_	B-Person
6		        likes	_	_	O
7		        Bob	_	_	B-Person
8		        NaN	NaN	NaN	NaN
9	Sentence: 3	Alice	_	_	B-Person

提前谢谢!

英文:

I have a Pandas data frame with words on each line in a column called "Word".
The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.

df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
df.tail(20)

Index	Word	_	_	Tag

0	I	_	_	O
1	am	_	_	O
2	from	_	_	O
3	Madrid	_	_	B-City
4	NaN	  NaN  NaN	NaN
5	Alice	_	_	B-Person
6	likes	_	_	O
7	Bob	_	_	B-Person

I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values.
At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc

Index	Sentence #	Word	_	_	Tag

0	Sentence: 1	I	_	_	O
1		        am	_	_	O
2		        from	_	_	O
3		        Oxford	_	_	B-City
4		        NaN	NaN	NaN	NaN
5	Sentence: 2	Alice	_	_	B-Person
6		        likes	_	_	O
7		        Bob	_	_	B-Person
8		        NaN	NaN	NaN	NaN
9	Sentence: 3	Alice	_	_	B-Person

Thank you in advance!

答案1

得分: 0

我会使用布尔索引:

m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

输出:

   Index    Word    _    _       Tag     Sentence
0      0       I    _    _         O  Sentence: 1
1      1      am    _    _         O          NaN
2      2    from    _    _         O          NaN
3      3  Madrid    _    _    B-City          NaN
4      4     NaN  NaN  NaN       NaN          NaN
5      5   Alice    _    _  B-Person  Sentence: 2
6      6   likes    _    _         O          NaN
7      7     Bob    _    _  B-Person          NaN
英文:

I would use boolean indexing:

m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

Output:

   Index    Word    _    _       Tag     Sentence
0      0       I    _    _         O  Sentence: 1
1      1      am    _    _         O          NaN
2      2    from    _    _         O          NaN
3      3  Madrid    _    _    B-City          NaN
4      4     NaN  NaN  NaN       NaN          NaN
5      5   Alice    _    _  B-Person  Sentence: 2
6      6   likes    _    _         O          NaN
7      7     Bob    _    _  B-Person          NaN

huangapple
  • 本文由 发表于 2023年6月8日 01:50:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425889.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定