英文:
Pandas dataframe - Separate sentences by NaN values
问题
我有一个Pandas数据框,每行都有一个名为“Word”的列中的单词。
每个句子的分隔符是一个空字符串"",所以我正在使用skip_blank_lines来查看分隔。
df = pd.read_csv("Data-June-2023.txt", sep=" ", skip_blank_lines=False)
df.tail(20)
Index Word _ _ Tag
0 I _ _ O
1 am _ _ O
2 from _ _ O
3 Madrid _ _ B-City
4 NaN NaN NaN NaN
5 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
我想要创建一个名为"Sentence #"的新列,通过迭代空行或NaN值。
在每个"Word"列中的NaN值处,它将创建一个新的句子计数,如句子:1,句子:2,句子:3...等等
Index Sentence # Word _ _ Tag
0 Sentence: 1 I _ _ O
1 am _ _ O
2 from _ _ O
3 Oxford _ _ B-City
4 NaN NaN NaN NaN
5 Sentence: 2 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
8 NaN NaN NaN NaN
9 Sentence: 3 Alice _ _ B-Person
提前谢谢!
英文:
I have a Pandas data frame with words on each line in a column called "Word".
The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.
df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
df.tail(20)
Index Word _ _ Tag
0 I _ _ O
1 am _ _ O
2 from _ _ O
3 Madrid _ _ B-City
4 NaN NaN NaN NaN
5 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values.
At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc
Index Sentence # Word _ _ Tag
0 Sentence: 1 I _ _ O
1 am _ _ O
2 from _ _ O
3 Oxford _ _ B-City
4 NaN NaN NaN NaN
5 Sentence: 2 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
8 NaN NaN NaN NaN
9 Sentence: 3 Alice _ _ B-Person
Thank you in advance!
答案1
得分: 0
我会使用布尔索引:
m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')
输出:
Index Word _ _ Tag Sentence
0 0 I _ _ O Sentence: 1
1 1 am _ _ O NaN
2 2 from _ _ O NaN
3 3 Madrid _ _ B-City NaN
4 4 NaN NaN NaN NaN NaN
5 5 Alice _ _ B-Person Sentence: 2
6 6 likes _ _ O NaN
7 7 Bob _ _ B-Person NaN
英文:
I would use boolean indexing:
m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')
Output:
Index Word _ _ Tag Sentence
0 0 I _ _ O Sentence: 1
1 1 am _ _ O NaN
2 2 from _ _ O NaN
3 3 Madrid _ _ B-City NaN
4 4 NaN NaN NaN NaN NaN
5 5 Alice _ _ B-Person Sentence: 2
6 6 likes _ _ O NaN
7 7 Bob _ _ B-Person NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论