Pandas数据框架 – 通过NaN值分隔句子

huangapple go评论96阅读模式
英文:

Pandas dataframe - Separate sentences by NaN values

问题

我有一个Pandas数据框,每行都有一个名为“Word”的列中的单词。
每个句子的分隔符是一个空字符串"",所以我正在使用skip_blank_lines来查看分隔。

  1. df = pd.read_csv("Data-June-2023.txt", sep=" ", skip_blank_lines=False)
  2. df.tail(20)
  3. Index Word _ _ Tag
  4. 0 I _ _ O
  5. 1 am _ _ O
  6. 2 from _ _ O
  7. 3 Madrid _ _ B-City
  8. 4 NaN NaN NaN NaN
  9. 5 Alice _ _ B-Person
  10. 6 likes _ _ O
  11. 7 Bob _ _ B-Person

我想要创建一个名为"Sentence #"的新列,通过迭代空行或NaN值。
在每个"Word"列中的NaN值处,它将创建一个新的句子计数,如句子:1,句子:2,句子:3...等等

  1. Index Sentence # Word _ _ Tag
  2. 0 Sentence: 1 I _ _ O
  3. 1 am _ _ O
  4. 2 from _ _ O
  5. 3 Oxford _ _ B-City
  6. 4 NaN NaN NaN NaN
  7. 5 Sentence: 2 Alice _ _ B-Person
  8. 6 likes _ _ O
  9. 7 Bob _ _ B-Person
  10. 8 NaN NaN NaN NaN
  11. 9 Sentence: 3 Alice _ _ B-Person

提前谢谢!

英文:

I have a Pandas data frame with words on each line in a column called "Word".
The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.

  1. df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
  2. df.tail(20)
  3. Index Word _ _ Tag
  4. 0 I _ _ O
  5. 1 am _ _ O
  6. 2 from _ _ O
  7. 3 Madrid _ _ B-City
  8. 4 NaN NaN NaN NaN
  9. 5 Alice _ _ B-Person
  10. 6 likes _ _ O
  11. 7 Bob _ _ B-Person

I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values.
At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc

  1. Index Sentence # Word _ _ Tag
  2. 0 Sentence: 1 I _ _ O
  3. 1 am _ _ O
  4. 2 from _ _ O
  5. 3 Oxford _ _ B-City
  6. 4 NaN NaN NaN NaN
  7. 5 Sentence: 2 Alice _ _ B-Person
  8. 6 likes _ _ O
  9. 7 Bob _ _ B-Person
  10. 8 NaN NaN NaN NaN
  11. 9 Sentence: 3 Alice _ _ B-Person

Thank you in advance!

答案1

得分: 0

我会使用布尔索引:

  1. m = df['Word'].isna().shift(fill_value=True)
  2. df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

输出:

  1. Index Word _ _ Tag Sentence
  2. 0 0 I _ _ O Sentence: 1
  3. 1 1 am _ _ O NaN
  4. 2 2 from _ _ O NaN
  5. 3 3 Madrid _ _ B-City NaN
  6. 4 4 NaN NaN NaN NaN NaN
  7. 5 5 Alice _ _ B-Person Sentence: 2
  8. 6 6 likes _ _ O NaN
  9. 7 7 Bob _ _ B-Person NaN
英文:

I would use boolean indexing:

  1. m = df['Word'].isna().shift(fill_value=True)
  2. df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

Output:

  1. Index Word _ _ Tag Sentence
  2. 0 0 I _ _ O Sentence: 1
  3. 1 1 am _ _ O NaN
  4. 2 2 from _ _ O NaN
  5. 3 3 Madrid _ _ B-City NaN
  6. 4 4 NaN NaN NaN NaN NaN
  7. 5 5 Alice _ _ B-Person Sentence: 2
  8. 6 6 likes _ _ O NaN
  9. 7 7 Bob _ _ B-Person NaN

huangapple
  • 本文由 发表于 2023年6月8日 01:50:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425889.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定