提取在模式之间的多个字符串块

huangapple go评论66阅读模式
英文:

Extracting multiple chunks of string between patterns

问题

这个帖子询问如何在R中提取两个字符串之间的字符串:https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r

我正在寻求类似的答案,但现在要覆盖多个模式之间的多次出现。

示例字符串:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:

在词语“Fabricante”和“CNPJ”之间的每次出现之间,都有一个公司名称,我想提取这些名称。在此字符串中,有三家公司:“EMS S/A”,“EMS S/A”和“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。

基于上面的帖子,以下代码

gsub(".*Fabricante: *(.+) CNPJ:.*", "\", df$manufacturing_location[92])

返回最后一次出现的公司名称,“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。

当我更改为

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\", df$manufacturing_location[92])

它返回第一个。我尝试更改为\\2,因为我认为这将编号出现次数,但然后我得到一个空字符串。我还尝试使用stringrstr_match_all,但它也不起作用。

有人知道如何调整语法以使代码能够根据需要返回这三个公司名称吗?

我想将这放入mutate语法中,以便将其传递给包含许多这种字符串的数据集,并返回第一个、第二个和第三个条目作为变量。为此,我发现无法使str_match_all起作用。

英文:

This post asks how to extract a string between other two strings in R: https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r

I'm seeking a similar answer, but now covering multiple occurences between patterns.

Example string:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:

Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

Based on the post above, this code

gsub(".*Fabricante: *(.+) CNPJ:.*", "\", df$manufacturing_location[92])

returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

When I change to

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\", df$manufacturing_location[92])

it returns the first. I tried changing to \\2 as I thought this would number occurences, but then I get an empty string. I also tried using stringr's str_match_all, but it did not work too.

Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?

I would like to put this into a mutate syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all work.

答案1

得分: 2

我们可以如下使用 str_match_all

x <- "Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricação: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricação: Fabricante:  NOVAMED FABRICAÇÃO DE PRODUTOS FARMACÊUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricação:"
matches <- str_match_all(x, "(?<=\\bFabricante:  ).*?(?= CNPJ:)")[[1]]
matches

这里是正在使用的正则表达式模式的解释:

  • (?<=\\bFabricante: ) 回顾并断言 Fabricante: 在前面
  • .*? 然后匹配所有内容,直到找到最近的
  • (?= CNPJ:) 向前查看并断言 CNPJ: 跟随
英文:

We can use str_match_all as follows:

<!-- language: r -->

x &lt;- &quot;Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endere&#231;o:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endere&#231;o:  HORTOLANDIA - SP - BRASIL Etapa de Fabrica&#231;ao: Fabricante:  NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endere&#231;o:  MANAUS - AM - BRASIL Etapa de Fabrica&#231;ao:&quot;
matches &lt;- str_match_all(x, &quot;(?&lt;=\\bFabricante:  ).*?(?= CNPJ:)&quot;)[[1]]
matches

     [,1]                                                    
[1,] &quot;EMS S/A&quot;                                               
[2,] &quot;EMS S/A&quot;                                               
[3,] &quot;NOVAMED FABRICA&lt;U+00C7&gt;AO DE PRODUTOS FARMACEUTICOS LTDA&quot;

Here is an explanation of the regex pattern being used:

  • (?&lt;=\\bFabricante: ) lookbehind and assert that Fabricante: precedes
  • .*? then match all content until reaching the nearest
  • (?= CNPJ:) lookahead and assert that CNPJ: follows

答案2

得分: 0

el(strsplit(x, '\s?\w*:\s+'))[c(2, 6, 10)]

英文:

You could strsplit at the key words and subset to desired elements.

el(strsplit(x, &#39;\\s?\\w*:\\s+&#39;))[c(2, 6, 10)]
# [1] &quot;EMS S/A&quot;                                           &quot;EMS S/A&quot;                                          
# [3] &quot;NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA&quot;

答案3

得分: 0

似乎你的数据是Debian控制文件格式。你可以在添加换行符后,使用基本的R函数read.dcf来读取它。然后你可以访问你想要的数据列。

read.dcf(textConnection(gsub("(Fabricante)","\\n\",gsub(" (\\S+:)", "\\n\", x))),all = TRUE)

上述代码的作用是将关键字换行,并将每行数据与前一行分隔开,然后将其转换为可以读取的文件对象,最后使用read.dcf函数将其读入R。

英文:

It seems that your data is of a debian control file format. You could use read.dcf in base R after adding line breaks to it. Then you can access any column of the data that you want.

read.dcf(textConnection(gsub(&quot;(Fabricante)&quot;,&quot;\n\&quot;,gsub(&quot; (\\S+:)&quot;, &quot;\n\&quot;, x))),all = TRUE)
                                         Fabricante                 CNPJ                                     Endere&#231;o Fabrica&#231;ao
1                                           EMS S/A - 57.507.378/0001-01 SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de           
2                                           EMS S/A - 57.507.378/0003-65           HORTOLANDIA - SP - BRASIL Etapa de           
3 NOVAMED FABRICA&#199;AO DE PRODUTOS FARMACEUTICOS LTDA - 12.424.020/0001-79                MANAUS - AM - BRASIL Etapa de 

--- breakdown:

gsub(&quot; *(\\S+:)&quot;, &quot;\n\&quot;, x) |&gt; #Every keyword needs to start a new line 
  gsub(&quot;(Fabricante)&quot;, &quot;\n\&quot;, x=_) |&gt; #Every row data separated from the previous
  textConnection() |&gt; #  Convert to a file readable object
  read.dcf(all =TRUE) # Read into R

huangapple
  • 本文由 发表于 2023年3月1日 14:20:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75600178.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定