R正则表达式用于拆分姓名字符串

huangapple go评论93阅读模式
英文:

R regular expression to split apart name strings

问题

我有一个包含不同格式姓名的数据框,我想将这些姓名拆分为名、中间名和姓。这是一个包含不同姓名格式的示例数据框:

df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
                           "M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))

我有来自[先前问题](https://stackoverflow.com/questions/76316213/r-if-else-statement-that-depends-on-number-of-elements-in-a-character-string)的代码,可以正确拆分非缩写,但在同时存在名字和中间名缩写时会失败。以下是该代码及其输出:

names <- df1 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")

我的理想输出是这样的,正确拆分了缩写:

Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               

使用 R 4.2.2
英文:

I have a dataframe with names listed in various different formats, and I'd like to split these names into first, middle, and last name. Here's an example dataframe with different name formats:

df1 &lt;- data.frame(name = c(&quot;Matt Smith&quot;, &quot;Matt L. Smith&quot;, &quot;Matt Louis Smith&quot;,
                           &quot;M. Smith&quot;, &quot;M.L. Smith&quot;, &quot;M.L. Smith, Jr.&quot;, &quot;M. Smith, Jr.&quot;, &quot;Unknown&quot;))

I have code from a previous question that splits apart non-initials correctly, but it fails when the first and middle initial are both present. Here's that code and its output:

names &lt;- df1 %&gt;%
  mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)

Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                  M.L.                                  Smith
6                  M.L.                             Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                              

My ideal output would be this, with the initials correctly split:

Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               

Using R 4.2.2

答案1

得分: 1

另一种使用tidyr中的extract函数的解决方案;基本上是一行代码:

library(tidyr)
df1 %>%
  extract(name,
          into = c("first", "middle", "last"),
          regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown
英文:

Another solution using extract from tidyr; basically a one-liner:

library(tidyr)
df1 %&gt;%
  extract(name,
          into = c(&quot;first&quot;, &quot;middle&quot;, &quot;last&quot;),
          regex = &quot;(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?&quot;)
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown                  

答案2

得分: 0

使用Gregor的评论解决 - 包括在任何跟随非空格的句点后插入空格的预处理步骤

df2 <- df1 %>%
  mutate(name = gsub(pattern = "(\\.)([^ ])", "\ \", name))

names <- df2 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")


Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown
英文:

Solved using the comment by Gregor - including a pre-processing step to insert a space after any dots followed by a non-space

df2 &lt;- df1 %&gt;%
  mutate(name = gsub(pattern = &quot;(\\.)([^ ])&quot;,&quot;\ \&quot;, name))

names &lt;- df2 %&gt;%
  mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)


Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               

huangapple
  • 本文由 发表于 2023年6月14日 23:24:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76475166.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定