R正则表达式用于拆分姓名字符串

huangapple go评论122阅读模式
英文:

R regular expression to split apart name strings

问题

  1. 我有一个包含不同格式姓名的数据框,我想将这些姓名拆分为名、中间名和姓。这是一个包含不同姓名格式的示例数据框:
  2. df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
  3. "M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
  4. 我有来自[先前问题](https://stackoverflow.com/questions/76316213/r-if-else-statement-that-depends-on-number-of-elements-in-a-character-string)的代码,可以正确拆分非缩写,但在同时存在名字和中间名缩写时会失败。以下是该代码及其输出:
  5. names <- df1 %>%
  6. mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  7. separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
  8. 我的理想输出是这样的,正确拆分了缩写:
  9. Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
  10. 1 Matt Smith
  11. 2 Matt L. Smith
  12. 3 Matt Louis Smith
  13. 4 M. Smith
  14. 5 M. L. Smith
  15. 6 M. L. Smith, Jr.
  16. 7 M. Smith, Jr.
  17. 8 Unknown
  18. 使用 R 4.2.2
英文:

I have a dataframe with names listed in various different formats, and I'd like to split these names into first, middle, and last name. Here's an example dataframe with different name formats:

  1. df1 &lt;- data.frame(name = c(&quot;Matt Smith&quot;, &quot;Matt L. Smith&quot;, &quot;Matt Louis Smith&quot;,
  2. &quot;M. Smith&quot;, &quot;M.L. Smith&quot;, &quot;M.L. Smith, Jr.&quot;, &quot;M. Smith, Jr.&quot;, &quot;Unknown&quot;))

I have code from a previous question that splits apart non-initials correctly, but it fails when the first and middle initial are both present. Here's that code and its output:

  1. names &lt;- df1 %&gt;%
  2. mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  3. separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)
  4. Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
  5. 1 Matt Smith
  6. 2 Matt L. Smith
  7. 3 Matt Louis Smith
  8. 4 M. Smith
  9. 5 M.L. Smith
  10. 6 M.L. Smith, Jr.
  11. 7 M. Smith, Jr.
  12. 8 Unknown

My ideal output would be this, with the initials correctly split:

  1. Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
  2. 1 Matt Smith
  3. 2 Matt L. Smith
  4. 3 Matt Louis Smith
  5. 4 M. Smith
  6. 5 M. L. Smith
  7. 6 M. L. Smith, Jr.
  8. 7 M. Smith, Jr.
  9. 8 Unknown

Using R 4.2.2

答案1

得分: 1

另一种使用tidyr中的extract函数的解决方案;基本上是一行代码:

  1. library(tidyr)
  2. df1 %>%
  3. extract(name,
  4. into = c("first", "middle", "last"),
  5. regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
  6. first middle last
  7. 1 Matt Smith
  8. 2 Matt L. Smith
  9. 3 Matt Louis Smith
  10. 4 M. Smith
  11. 5 M. L. Smith
  12. 6 M. L. Smith, Jr.
  13. 7 M. Smith, Jr.
  14. 8 Unknown
英文:

Another solution using extract from tidyr; basically a one-liner:

  1. library(tidyr)
  2. df1 %&gt;%
  3. extract(name,
  4. into = c(&quot;first&quot;, &quot;middle&quot;, &quot;last&quot;),
  5. regex = &quot;(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?&quot;)
  6. first middle last
  7. 1 Matt Smith
  8. 2 Matt L. Smith
  9. 3 Matt Louis Smith
  10. 4 M. Smith
  11. 5 M. L. Smith
  12. 6 M. L. Smith, Jr.
  13. 7 M. Smith, Jr.
  14. 8 Unknown

答案2

得分: 0

使用Gregor的评论解决 - 包括在任何跟随非空格的句点后插入空格的预处理步骤

  1. df2 <- df1 %>%
  2. mutate(name = gsub(pattern = "(\\.)([^ ])", "\ \", name))
  3. names <- df2 %>%
  4. mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  5. separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
  6. Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
  7. 1 Matt Smith
  8. 2 Matt L. Smith
  9. 3 Matt Louis Smith
  10. 4 M. Smith
  11. 5 M. L. Smith
  12. 6 M. L. Smith, Jr.
  13. 7 M. Smith, Jr.
  14. 8 Unknown
英文:

Solved using the comment by Gregor - including a pre-processing step to insert a space after any dots followed by a non-space

  1. df2 &lt;- df1 %&gt;%
  2. mutate(name = gsub(pattern = &quot;(\\.)([^ ])&quot;,&quot;\ \&quot;, name))
  3. names &lt;- df2 %&gt;%
  4. mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  5. separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)
  6. Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
  7. 1 Matt Smith
  8. 2 Matt L. Smith
  9. 3 Matt Louis Smith
  10. 4 M. Smith
  11. 5 M. L. Smith
  12. 6 M. L. Smith, Jr.
  13. 7 M. Smith, Jr.
  14. 8 Unknown

huangapple
  • 本文由 发表于 2023年6月14日 23:24:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76475166.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定