2023年6月14日 23:24:08go评论122阅读模式

英文:

R regular expression to split apart name strings

问题

我有一个包含不同格式姓名的数据框，我想将这些姓名拆分为名、中间名和姓。这是一个包含不同姓名格式的示例数据框：
df1 <- data.frame(name = c("Matt Smith", "Matt L. Smith", "Matt Louis Smith",
                           "M. Smith", "M.L. Smith", "M.L. Smith, Jr.", "M. Smith, Jr.", "Unknown"))
我有来自[先前问题](https://stackoverflow.com/questions/76316213/r-if-else-statement-that-depends-on-number-of-elements-in-a-character-string)的代码，可以正确拆分非缩写，但在同时存在名字和中间名缩写时会失败。以下是该代码及其输出：
names <- df1 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
我的理想输出是这样的，正确拆分了缩写：
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown                               
使用 R 4.2.2

英文:

I have a dataframe with names listed in various different formats, and I'd like to split these names into first, middle, and last name. Here's an example dataframe with different name formats:

df1 &lt;- data.frame(name = c(&quot;Matt Smith&quot;, &quot;Matt L. Smith&quot;, &quot;Matt Louis Smith&quot;,
                           &quot;M. Smith&quot;, &quot;M.L. Smith&quot;, &quot;M.L. Smith, Jr.&quot;, &quot;M. Smith, Jr.&quot;, &quot;Unknown&quot;))

I have code from a previous question that splits apart non-initials correctly, but it fails when the first and middle initial are both present. Here's that code and its output:

names &lt;- df1 %&gt;%
  mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                  M.L.                                  Smith
6                  M.L.                             Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown

My ideal output would be this, with the initials correctly split:

Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown

Using R 4.2.2

答案1

得分: 1

另一种使用tidyr中的extract函数的解决方案；基本上是一行代码：

library(tidyr)
df1 %>%
  extract(name,
          into = c("first", "middle", "last"),
          regex = "(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?")
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown

英文:

Another solution using extract from tidyr; basically a one-liner:

library(tidyr)
df1 %&gt;%
  extract(name,
          into = c(&quot;first&quot;, &quot;middle&quot;, &quot;last&quot;),
          regex = &quot;(\\w+\\.?)\\s*(?:([^\\s,]+)\\s)?(?:(\\w+(?:,\\sJr\\.)?))?&quot;)
    first middle       last
1    Matt             Smith
2    Matt     L.      Smith
3    Matt  Louis      Smith
4      M.             Smith
5      M.     L.      Smith
6      M.     L. Smith, Jr.
7      M.        Smith, Jr.
8 Unknown

答案2

得分: 0

使用Gregor的评论解决 - 包括在任何跟随非空格的句点后插入空格的预处理步骤

df2 <- df1 %>%
  mutate(name = gsub(pattern = "(\\.)([^ ])", "\ \", name))
names <- df2 %>%
  mutate(name = gsub('([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?', '\=\=\', name)) %>%
  separate(name, c("Collector.First.Name1", "Collector.Middle1", "Collector.Last.Name1"), "=")
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown

英文:

Solved using the comment by Gregor - including a pre-processing step to insert a space after any dots followed by a non-space

df2 &lt;- df1 %&gt;%
  mutate(name = gsub(pattern = &quot;(\\.)([^ ])&quot;,&quot;\ \&quot;, name))
names &lt;- df2 %&gt;%
  mutate(name = gsub(&#39;([A-Za-z\\.?]+) ([A-Za-z\\.]+ )?([A-Za-z]+)?&#39;, &#39;\=\=\&#39;, name)) %&gt;%
  separate(name, c(&quot;Collector.First.Name1&quot;, &quot;Collector.Middle1&quot;, &quot;Collector.Last.Name1&quot;), &quot;=&quot;)
Collector.First.Name1 Collector.Middle1 Collector.Last.Name1
1                  Matt                                  Smith
2                  Matt               L.                 Smith
3                  Matt            Louis                 Smith
4                    M.                                  Smith
5                    M.               L.                 Smith
6                    M.               L.            Smith, Jr.
7                    M.                             Smith, Jr.
8               Unknown

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R正则表达式用于拆分姓名字符串

问题

答案1

答案2

Notepad++ find value with regex and replace to value that find with another regex

如何在时间序列分析中计算每个时间点的唯一基因数？

如何在成对比较图中显示字母？

合并两个具有多个ID的数据框（使用查找或条件？）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。