英文:
Extracting multiple chunks of string between patterns
问题
这个帖子询问如何在R中提取两个字符串之间的字符串:https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r
我正在寻求类似的答案,但现在要覆盖多个模式之间的多次出现。
示例字符串:
Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante: NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricaçao:
在词语“Fabricante”和“CNPJ”之间的每次出现之间,都有一个公司名称,我想提取这些名称。在此字符串中,有三家公司:“EMS S/A”,“EMS S/A”和“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。
基于上面的帖子,以下代码
gsub(".*Fabricante: *(.+) CNPJ:.*", "\", df$manufacturing_location[92])
返回最后一次出现的公司名称,“NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS”。
当我更改为
gsub(".*Fabricante: *(.*?) CNPJ:.*", "\", df$manufacturing_location[92])
它返回第一个。我尝试更改为\\2
,因为我认为这将编号出现次数,但然后我得到一个空字符串。我还尝试使用stringr
的str_match_all
,但它也不起作用。
有人知道如何调整语法以使代码能够根据需要返回这三个公司名称吗?
我想将这放入mutate
语法中,以便将其传递给包含许多这种字符串的数据集,并返回第一个、第二个和第三个条目作为变量。为此,我发现无法使str_match_all
起作用。
英文:
This post asks how to extract a string between other two strings in R: https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r
I'm seeking a similar answer, but now covering multiple occurences between patterns.
Example string:
Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante: NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricaçao:
Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".
Based on the post above, this code
gsub(".*Fabricante: *(.+) CNPJ:.*", "\", df$manufacturing_location[92])
returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".
When I change to
gsub(".*Fabricante: *(.*?) CNPJ:.*", "\", df$manufacturing_location[92])
it returns the first. I tried changing to \\2
as I thought this would number occurences, but then I get an empty string. I also tried using stringr
's str_match_all
, but it did not work too.
Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?
I would like to put this into a mutate
syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all
work.
答案1
得分: 2
我们可以如下使用 str_match_all
:
x <- "Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricação: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricação: Fabricante: NOVAMED FABRICAÇÃO DE PRODUTOS FARMACÊUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricação:"
matches <- str_match_all(x, "(?<=\\bFabricante: ).*?(?= CNPJ:)")[[1]]
matches
这里是正在使用的正则表达式模式的解释:
(?<=\\bFabricante: )
回顾并断言Fabricante:
在前面.*?
然后匹配所有内容,直到找到最近的(?= CNPJ:)
向前查看并断言CNPJ:
跟随
英文:
We can use str_match_all
as follows:
<!-- language: r -->
x <- "Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante: NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricaçao:"
matches <- str_match_all(x, "(?<=\\bFabricante: ).*?(?= CNPJ:)")[[1]]
matches
[,1]
[1,] "EMS S/A"
[2,] "EMS S/A"
[3,] "NOVAMED FABRICA<U+00C7>AO DE PRODUTOS FARMACEUTICOS LTDA"
Here is an explanation of the regex pattern being used:
(?<=\\bFabricante: )
lookbehind and assert thatFabricante:
precedes.*?
then match all content until reaching the nearest(?= CNPJ:)
lookahead and assert thatCNPJ:
follows
答案2
得分: 0
el(strsplit(x, '\s?\w*:\s+'))[c(2, 6, 10)]
英文:
You could strsplit
at the key words and subset to desired elements.
el(strsplit(x, '\\s?\\w*:\\s+'))[c(2, 6, 10)]
# [1] "EMS S/A" "EMS S/A"
# [3] "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA"
答案3
得分: 0
似乎你的数据是Debian控制文件格式。你可以在添加换行符后,使用基本的R函数read.dcf
来读取它。然后你可以访问你想要的数据列。
read.dcf(textConnection(gsub("(Fabricante)","\\n\",gsub(" (\\S+:)", "\\n\", x))),all = TRUE)
上述代码的作用是将关键字换行,并将每行数据与前一行分隔开,然后将其转换为可以读取的文件对象,最后使用read.dcf
函数将其读入R。
英文:
It seems that your data is of a debian control file format. You could use read.dcf
in base R after adding line breaks to it. Then you can access any column of the data that you want.
read.dcf(textConnection(gsub("(Fabricante)","\n\",gsub(" (\\S+:)", "\n\", x))),all = TRUE)
Fabricante CNPJ Endereço Fabricaçao
1 EMS S/A - 57.507.378/0001-01 SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de
2 EMS S/A - 57.507.378/0003-65 HORTOLANDIA - SP - BRASIL Etapa de
3 NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA - 12.424.020/0001-79 MANAUS - AM - BRASIL Etapa de
--- breakdown:
gsub(" *(\\S+:)", "\n\", x) |> #Every keyword needs to start a new line
gsub("(Fabricante)", "\n\", x=_) |> #Every row data separated from the previous
textConnection() |> # Convert to a file readable object
read.dcf(all =TRUE) # Read into R
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论