英文:
R regular expression to split apart name strings
问题
我有一个包含不同格式姓名的数据框,我想将这些姓名拆分为名、中间名和姓。这是一个包含不同姓名格式的示例数据框:
df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
                           "M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
我有来自[先前问题](https://stackoverflow.com/questions/76316213/r-if-else-statement-that-depends-on-number-of-elements-in-a-character-string)的代码,可以正确拆分非缩写,但在同时存在名字和中间名缩写时会失败。以下是该代码及其输出:
names <- df1 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
我的理想输出是这样的,正确拆分了缩写:
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               
使用 R 4.2.2
英文:
I have a dataframe with names listed in various different formats, and I'd like to split these names into first, middle, and last name. Here's an example dataframe with different name formats:
df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
                           "M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
I have code from a previous question that splits apart non-initials correctly, but it fails when the first and middle initial are both present. Here's that code and its output:
names <- df1 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                  M.L.                                  Smith
6                  M.L.                             Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                              
My ideal output would be this, with the initials correctly split:
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               
Using R 4.2.2
答案1
得分: 1
另一种使用tidyr中的extract函数的解决方案;基本上是一行代码:
library(tidyr)
df1 %>%
  extract(name,
          into = c("first", "middle", "last"),
          regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown
英文:
Another solution using extract from tidyr; basically a one-liner:
library(tidyr)
df1 %>%
  extract(name,
          into = c("first", "middle", "last"),
          regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown                  
答案2
得分: 0
使用Gregor的评论解决 - 包括在任何跟随非空格的句点后插入空格的预处理步骤
df2 <- df1 %>%
  mutate(name = gsub(pattern = "(\\.)([^ ])", "\ \", name))
names <- df2 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown
英文:
Solved using the comment by Gregor - including a pre-processing step to insert a space after any dots followed by a non-space
df2 <- df1 %>%
  mutate(name = gsub(pattern = "(\\.)([^ ])","\ \", name))
names <- df2 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论