英文:
R regular expression to split apart name strings
问题
我有一个包含不同格式姓名的数据框,我想将这些姓名拆分为名、中间名和姓。这是一个包含不同姓名格式的示例数据框:
df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
"M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
我有来自[先前问题](https://stackoverflow.com/questions/76316213/r-if-else-statement-that-depends-on-number-of-elements-in-a-character-string)的代码,可以正确拆分非缩写,但在同时存在名字和中间名缩写时会失败。以下是该代码及其输出:
names <- df1 %>%
mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
我的理想输出是这样的,正确拆分了缩写:
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
使用 R 4.2.2
英文:
I have a dataframe with names listed in various different formats, and I'd like to split these names into first, middle, and last name. Here's an example dataframe with different name formats:
df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
"M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
I have code from a previous question that splits apart non-initials correctly, but it fails when the first and middle initial are both present. Here's that code and its output:
names <- df1 %>%
mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M.L. Smith
6 M.L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
My ideal output would be this, with the initials correctly split:
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
Using R 4.2.2
答案1
得分: 1
另一种使用tidyr
中的extract
函数的解决方案;基本上是一行代码:
library(tidyr)
df1 %>%
extract(name,
into = c("first", "middle", "last"),
regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
first middle last
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
英文:
Another solution using extract
from tidyr
; basically a one-liner:
library(tidyr)
df1 %>%
extract(name,
into = c("first", "middle", "last"),
regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
first middle last
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
答案2
得分: 0
使用Gregor的评论解决 - 包括在任何跟随非空格的句点后插入空格的预处理步骤
df2 <- df1 %>%
mutate(name = gsub(pattern = "(\\.)([^ ])", "\ \", name))
names <- df2 %>%
mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
英文:
Solved using the comment by Gregor - including a pre-processing step to insert a space after any dots followed by a non-space
df2 <- df1 %>%
mutate(name = gsub(pattern = "(\\.)([^ ])","\ \", name))
names <- df2 %>%
mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1 Matt Smith
2 Matt L. Smith
3 Matt Louis Smith
4 M. Smith
5 M. L. Smith
6 M. L. Smith, Jr.
7 M. Smith, Jr.
8 Unknown
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论