去掉所有距单词右边至少2个空格的数字和逗号。

huangapple go评论67阅读模式
英文:

Get rid of all the numbers and commas that are at least 2 spaces to the right of the word

问题

Here's the translated code snippet without any additional content:

我正在尝试抓取支持 Microsoft 的语音服务的[区域表](https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions#speech-service)。我已经成功获取到以下字符向量:

```R
region <- c("southafricanorth 6", "eastasia 5", "southeastasia 1,2,3,4,5", 
"australiaeast 1,2,3,4", "centralindia 1,2,3,4,5", "japaneast 2,5", 
"japanwest", "koreacentral 2", "canadacentral 1", "northeurope 1,2,4,5", 
"westeurope 1,2,3,4,5", "francecentral", "germanywestcentral", 
"norwayeast", "switzerlandnorth 6", "switzerlandwest", "uksouth 1,2,3,4", 
"uaenorth 6", "brazilsouth 6", "centralus", "eastus 1,2,3,4,5", 
"eastus2 1,2,4,5", "northcentralus 4,6", "southcentralus 1,2,3,4,5,6", 
"westcentralus 5", "westus 2,5", "westus2 1,2,4,5", "westus3"
)

用正则表达式去除距离单词至少两个空格的所有数字和逗号,例如,我只想要 westus2,而不是 westus2 1,2,4,5

我尝试过以下代码但未成功:gsub("\\s{2,}\\d+.*", "", region)


<details>
<summary>英文:</summary>

I&#39;m trying to scrape this [table of regions](https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions#speech-service) that support Microsoft&#39;s Speech service. I&#39;ve managed to get the following character vector:

region <- c("southafricanorth 6", "eastasia 5", "southeastasia 1,2,3,4,5",
"australiaeast 1,2,3,4", "centralindia 1,2,3,4,5", "japaneast 2,5",
"japanwest", "koreacentral 2", "canadacentral 1", "northeurope 1,2,4,5",
"westeurope 1,2,3,4,5", "francecentral", "germanywestcentral",
"norwayeast", "switzerlandnorth 6", "switzerlandwest", "uksouth 1,2,3,4",
"uaenorth 6", "brazilsouth 6", "centralus", "eastus 1,2,3,4,5",
"eastus2 1,2,4,5", "northcentralus 4,6", "southcentralus 1,2,3,4,5,6",
"westcentralus 5", "westus 2,5", "westus2 1,2,4,5", "westus3"
)


What is the regex that gets rid of all the numbers and commas that are at least 2 spaces to the right of the words? For ex, I just want `westus2`, instead of `westus2 1,2,4,5`. 

I&#39;ve tried this to no avail: `gsub(&quot;\\s{2,}\\d+.*&quot;, &quot;&quot;, region)`


</details>


# 答案1
**得分**: 4

以下是翻译好的内容:

"regions names without the superscripts are contained inside `<code>` tags in the HTML. So you could avoid the need for regexes by modifying your scraping code to something like:

```R
library(rvest)

url <- "https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions"

regions <- read_html(url) %>%
  # first table only
  html_element("table") %>%
  html_elements("code") %>%
  html_text()
  
regions

[1] "southafricanorth" "eastasia" "southeastasia" "australiaeast"
"centralindia" "japaneast" "japanwest" "koreacentral"
[9] "canadacentral" "northeurope" "westeurope" "francecentral"
"germanywestcentral" "norwayeast" "switzerlandnorth" "switzerlandwest"
[17] "uksouth" "uaenorth" "brazilsouth" "centralus"
"eastus" "eastus2" "northcentralus" "southcentralus"
[25] "westcentralus" "westus" "westus2" "westus3"`

请注意,上述内容中的代码部分未被翻译。

英文:

The regions names without the superscripts are contained inside &lt;code&gt; tags in the HTML. So you could avoid the need for regexes by modifying your scraping code to something like:

library(rvest)

url &lt;- &quot;https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions&quot;

regions &lt;- read_html(url) %&gt;% 
  # first table only
  html_element(&quot;table&quot;) %&gt;% 
  html_elements(&quot;code&quot;) %&gt;% 
  html_text()

regions

[1] &quot;southafricanorth&quot;   &quot;eastasia&quot;           &quot;southeastasia&quot;      &quot;australiaeast&quot;      
    &quot;centralindia&quot;       &quot;japaneast&quot;          &quot;japanwest&quot;          &quot;koreacentral&quot;      
[9] &quot;canadacentral&quot;      &quot;northeurope&quot;        &quot;westeurope&quot;         &quot;francecentral&quot;      
    &quot;germanywestcentral&quot; &quot;norwayeast&quot;         &quot;switzerlandnorth&quot;   &quot;switzerlandwest&quot;   
[17] &quot;uksouth&quot;            &quot;uaenorth&quot;           &quot;brazilsouth&quot;        &quot;centralus&quot;          
     &quot;eastus&quot;             &quot;eastus2&quot;            &quot;northcentralus&quot;     &quot;southcentralus&quot;    
[25] &quot;westcentralus&quot;      &quot;westus&quot;             &quot;westus2&quot;            &quot;westus3&quot;

答案2

得分: 2

另一个优雅的解决方案是stringr包中的word()函数:

默认情况下,第一个单词是:

word(string, start = 1L, end = start, sep = fixed(" "))

library(stringr)

word(region)

 [1] "southafricanorth"   "eastasia"           "southeastasia"      "australiaeast"     
 [5] "centralindia"       "japaneast"          "japanwest"          "koreacentral"      
 [9] "canadacentral"      "northeurope"        "westeurope"         "francecentral"     
[13] "germanywestcentral" "norwayeast"         "switzerlandnorth"   "switzerlandwest"   
[17] "uksouth"            "uaenorth"           "brazilsouth"        "centralus"         
[21] "eastus"             "eastus2"            "northcentralus"     "southcentralus"    
[25] "westcentralus"      "westus"             "westus2"            "westus3"
英文:

Another elegant solution is word() function from stringr package:

The first word is default:

word(string, start = 1L, end = start, sep = fixed(&quot; &quot;))

library(stringr)

word(region)

 [1] &quot;southafricanorth&quot;   &quot;eastasia&quot;           &quot;southeastasia&quot;      &quot;australiaeast&quot;     
 [5] &quot;centralindia&quot;       &quot;japaneast&quot;          &quot;japanwest&quot;          &quot;koreacentral&quot;      
 [9] &quot;canadacentral&quot;      &quot;northeurope&quot;        &quot;westeurope&quot;         &quot;francecentral&quot;     
[13] &quot;germanywestcentral&quot; &quot;norwayeast&quot;         &quot;switzerlandnorth&quot;   &quot;switzerlandwest&quot;   
[17] &quot;uksouth&quot;            &quot;uaenorth&quot;           &quot;brazilsouth&quot;        &quot;centralus&quot;         
[21] &quot;eastus&quot;             &quot;eastus2&quot;            &quot;northcentralus&quot;     &quot;southcentralus&quot;    
[25] &quot;westcentralus&quot;      &quot;westus&quot;             &quot;westus2&quot;            &quot;westus3&quot;

答案3

得分: 2

你的正则表达式不匹配,因为你的字符串没有两个空格。如果你将 \\s{2,} 改成 \\s ,它应该会得到预期的结果。

sub(" \\d+.*", "", region)

在这种情况下,看起来可以简化为

sub(" .*", "", region)

或者

sub(" .+", "", region)
英文:

Your regex does not match because you string does not have two spaces. If you change \\s{2,} to \\s or it should give the expected result.

sub(&quot;\\s\\d+.*&quot;, &quot;&quot;, region)
# [1] &quot;southafricanorth&quot;   &quot;eastasia&quot;           &quot;southeastasia&quot;     
# [4] &quot;australiaeast&quot;      &quot;centralindia&quot;       &quot;japaneast&quot;         
# [7] &quot;japanwest&quot;          &quot;koreacentral&quot;       &quot;canadacentral&quot;     
#[10] &quot;northeurope&quot;        &quot;westeurope&quot;         &quot;francecentral&quot;     
#[13] &quot;germanywestcentral&quot; &quot;norwayeast&quot;         &quot;switzerlandnorth&quot;  
#[16] &quot;switzerlandwest&quot;    &quot;uksouth&quot;            &quot;uaenorth&quot;          
#[19] &quot;brazilsouth&quot;        &quot;centralus&quot;          &quot;eastus&quot;            
#[22] &quot;eastus2&quot;            &quot;northcentralus&quot;     &quot;southcentralus&quot;    
#[25] &quot;westcentralus&quot;      &quot;westus&quot;             &quot;westus2&quot;           
#[28] &quot;westus3&quot;           

In this case it looks like that it could be simplified to

sub(&quot; .*&quot;, &quot;&quot;, region)

or

sub(&quot; .+&quot;, &quot;&quot;, region)

huangapple
  • 本文由 发表于 2023年5月17日 11:59:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76268464.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定