2023年3月1日 14:20:30go评论80阅读模式

英文:

Extracting multiple chunks of string between patterns

问题

这个帖子询问如何在R中提取两个字符串之间的字符串：https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r

我正在寻求类似的答案，但现在要覆盖多个模式之间的多次出现。

示例字符串：

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endere&#231;o:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endere&#231;o:  HORTOLANDIA - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endere&#231;o:  MANAUS - AM - BRASIL Etapa de Fabrica&#231;ao:

在词语“Fabricante”和“CNPJ”之间的每次出现之间，都有一个公司名称，我想提取这些名称。在此字符串中，有三家公司：“EMS S/A”，“EMS S/A”和“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。

基于上面的帖子，以下代码

gsub(".*Fabricante: *(.+) CNPJ:.*", "\", df$manufacturing_location[92])

返回最后一次出现的公司名称，“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。

当我更改为

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\", df$manufacturing_location[92])

它返回第一个。我尝试更改为\\2，因为我认为这将编号出现次数，但然后我得到一个空字符串。我还尝试使用stringr的str_match_all，但它也不起作用。

有人知道如何调整语法以使代码能够根据需要返回这三个公司名称吗？

我想将这放入mutate语法中，以便将其传递给包含许多这种字符串的数据集，并返回第一个、第二个和第三个条目作为变量。为此，我发现无法使str_match_all起作用。

英文:

This post asks how to extract a string between other two strings in R: https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r

I'm seeking a similar answer, but now covering multiple occurences between patterns.

Example string:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endere&#231;o:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endere&#231;o:  HORTOLANDIA - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endere&#231;o:  MANAUS - AM - BRASIL Etapa de Fabrica&#231;ao:

Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

Based on the post above, this code

gsub(&quot;.*Fabricante: *(.+) CNPJ:.*&quot;, &quot;\&quot;, df$manufacturing_location[92])

returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

When I change to

gsub(&quot;.*Fabricante: *(.*?) CNPJ:.*&quot;, &quot;\&quot;, df$manufacturing_location[92])

it returns the first. I tried changing to \\2 as I thought this would number occurences, but then I get an empty string. I also tried using stringr's str_match_all, but it did not work too.

Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?

I would like to put this into a mutate syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all work.

答案1

得分: 2

我们可以如下使用 str_match_all：

x <- "Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricação: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricação: Fabricante:  NOVAMED FABRICAÇÃO DE PRODUTOS FARMACÊUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricação:"
matches <- str_match_all(x, "(?<=\\bFabricante:  ).*?(?= CNPJ:)")[[1]]
matches

这里是正在使用的正则表达式模式的解释：

(?<=\\bFabricante: ) 回顾并断言 Fabricante: 在前面
.*? 然后匹配所有内容，直到找到最近的
(?= CNPJ:) 向前查看并断言 CNPJ: 跟随

英文:

We can use str_match_all as follows:

x &lt;- &quot;Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endere&#231;o:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endere&#231;o:  HORTOLANDIA - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endere&#231;o:  MANAUS - AM - BRASIL Etapa de Fabrica&#231;ao:&quot;
matches &lt;- str_match_all(x, &quot;(?&lt;=\\bFabricante:  ).*?(?= CNPJ:)&quot;)[[1]]
matches

     [,1]                                                    
[1,] &quot;EMS S/A&quot;                                               
[2,] &quot;EMS S/A&quot;                                               
[3,] &quot;NOVAMED FABRICA&lt;U+00C7&gt;AO DE PRODUTOS FARMACEUTICOS LTDA&quot;

Here is an explanation of the regex pattern being used:

(?<=\\bFabricante: ) lookbehind and assert that Fabricante: precedes
.*? then match all content until reaching the nearest
(?= CNPJ:) lookahead and assert that CNPJ: follows

答案2

得分: 0

el(strsplit(x, '\s?\w*:\s+'))[c(2, 6, 10)]

英文:

You could strsplit at the key words and subset to desired elements.

el(strsplit(x, &#39;\\s?\\w*:\\s+&#39;))[c(2, 6, 10)]
# [1] &quot;EMS S/A&quot;                                           &quot;EMS S/A&quot;                                          
# [3] &quot;NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA&quot;

答案3

得分: 0

似乎你的数据是Debian控制文件格式。你可以在添加换行符后，使用基本的R函数read.dcf来读取它。然后你可以访问你想要的数据列。

read.dcf(textConnection(gsub("(Fabricante)","\\n\",gsub(" (\\S+:)", "\\n\", x))),all = TRUE)

上述代码的作用是将关键字换行，并将每行数据与前一行分隔开，然后将其转换为可以读取的文件对象，最后使用read.dcf函数将其读入R。

英文:

It seems that your data is of a debian control file format. You could use read.dcf in base R after adding line breaks to it. Then you can access any column of the data that you want.

read.dcf(textConnection(gsub(&quot;(Fabricante)&quot;,&quot;\n\&quot;,gsub(&quot; (\\S+:)&quot;, &quot;\n\&quot;, x))),all = TRUE)
                                         Fabricante                 CNPJ                                     Endere&#231;o Fabrica&#231;ao
1                                           EMS S/A - 57.507.378/0001-01 SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de           
2                                           EMS S/A - 57.507.378/0003-65           HORTOLANDIA - SP - BRASIL Etapa de           
3 NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA - 12.424.020/0001-79                MANAUS - AM - BRASIL Etapa de

--- breakdown:

gsub(&quot; *(\\S+:)&quot;, &quot;\n\&quot;, x) |&gt; #Every keyword needs to start a new line 
  gsub(&quot;(Fabricante)&quot;, &quot;\n\&quot;, x=_) |&gt; #Every row data separated from the previous
  textConnection() |&gt; #  Convert to a file readable object
  read.dcf(all =TRUE) # Read into R

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取在模式之间的多个字符串块

问题

答案1

答案2

答案3

Manipulating Single Values in R to Column values

获取在运行函数时的命名数据框的名称。

如何将摘要函数存储到一个向量中，然后在R中使用for循环？

在dplyr::group_by中，获取一个或多个分组变量中的观察数量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论