从一列中提取一个单词/字母后面的数值到新的一列

huangapple go评论57阅读模式
英文:

Extract values after a word/alphabet from one column to a new column

问题

    GEOID_Final = c("42101000101", "42101000102", "42101000103", "42101000104", "42101000105")
    Desired_df = data.frame(id, GEOID, GEOID_Final)
英文:

Based on the data below how can I create another column based on the values after the alphabetU in every row of column GEOID?

Data:

id = c(1, 2, 3, 4, 5)
GEOID = c("1400000US42101000101", "1400000US42101000102", "1400000US42101000103",
"1400000US42101000104", "1400000US42101000105")

df = data.frame(id, GEOID)

Desired output:

id = c(1, 2, 3, 4, 5)
GEOID = c("1400000US42101000101", "1400000US42101000102", "1400000US42101000103",
"1400000US42101000104", "1400000US42101000105")
GEOID_Final = c("42101000101", "42101000102", "42101000103", "42101000104", "42101000105")
Desired_df = data.frame(id, GEOID, GEOID_Final)

Code:

library(stringr)
library(dplyr)

desired_df = df %>% word(?, sep = "US") # Stuck

答案1

得分: 3

我会使用正则表达式来找到“US”之前和“US”后面的所有数字,并从字符串中删除它们,只保留“US”后面的数字。

library(stringr); library(dplyr)

id = c(1, 2, 3, 4, 5)
GEOID = c("1400000US42101000101", "1400000US42101000102", "1400000US42101000103",
"1400000US42101000104", "1400000US42101000105")

df = data.frame(id, GEOID)

df %>%
mutate(
# 用空字符串替换每个“US”之前和“US”后面的数字
GEOID_final = str_replace(GEOID, pattern="\d+US" , replacement ="")
)

id GEOID GEOID_final

1 1 1400000US42101000101 42101000101

2 2 1400000US42101000102 42101000102

3 3 1400000US42101000103 42101000103

4 4 1400000US42101000104 42101000104

5 5 1400000US42101000105 42101000105

`str_replace`接受一个字符串(在这种情况下,是一个字符串向量,GEOID),并将用`replacement`替换掉与`pattern`匹配的内容。我使用的模式是\\\d+US,表示“匹配一个或多个数字,后跟US”。然后将其替换为空,只保留US后面的数字。您也可以使用.+US进行匹配,表示“匹配US之前的任何内容,以及US”。

希望这有所帮助!
英文:

I would use regular expressions to find every digits before "US" and US, and remove them from the string, only keeping the numbers after US.

library(stringr); library(dplyr)

id = c(1, 2, 3, 4, 5)
GEOID = c("1400000US42101000101", "1400000US42101000102", "1400000US42101000103",
          "1400000US42101000104", "1400000US42101000105")

df = data.frame(id, GEOID)


df %>%
  mutate(
    # replace every digit before US and US with empty string                       
    GEOID_final = str_replace(GEOID, pattern="\\d+US" , replacement ="")
  )
# id                GEOID GEOID_final
# 1  1 1400000US42101000101 42101000101
# 2  2 1400000US42101000102 42101000102
# 3  3 1400000US42101000103 42101000103
# 4  4 1400000US42101000104 42101000104
# 5  5 1400000US42101000105 42101000105

str_replace takes a string (or in this case, a vector of string, GEOID), and will replace with replacement what has been matched with the pattern. The pattern I use is \\d+US, which means 'match any digit one or more times, followed by US'. This is then replaced with nothing, only keeping the digits after US. You could also match it with .+US, which means 'match anything before US, and US'.

Hope this helps!

答案2

得分: 1

如果你想使用word()

    desired_df = df %>% word(?, sep = "US") # 卡住


尝试使用 `stringr::word`

    word(string = str_replace(GEOID, pattern = 'US', replacement = " "), start = 2, end = 2)

结果
> word(string = str_replace(GEOID, pattern = 'US', replacement = " "), start = 2, end = 2)
[1] "42101000101" "42101000102" "42101000103" "42101000104" "42101000105"

将它们保存在第三列或在mutate中使用:

    df %>% mutate(
      GEOID_Final = word(string = str_replace(GEOID, pattern = 'US', replacement = " "), start = 2, end = 2)
    )

结果

  

        id                GEOID GEOID_Final
    1  1 1400000US42101000101 42101000101
    2  2 1400000US42101000102 42101000102
    3  3 1400000US42101000103 42101000103
    4  4 1400000US42101000104 42101000104
    5  5 1400000US42101000105 42101000105
英文:

If you want to use word()

desired_df = df %>% word(?, sep = "US") # Stuck

try this with stringr::word

word(string = str_replace(GEOID,pattern = 'US',replacement = " "),start = 2,end = 2)

result
> word(string = str_replace(GEOID,pattern = 'US',replacement = " "),start = 2,end = 2)
[1] "42101000101" "42101000102" "42101000103" "42101000104" "42101000105"

save them in a 3rd column or use in mutate:

df%>%mutate(
GEOID_Final =  word(string = str_replace(GEOID,pattern = 'US',replacement = " "),start = 2,end = 2)
  )

result

    id                GEOID GEOID_Final
1  1 1400000US42101000101 42101000101
2  2 1400000US42101000102 42101000102
3  3 1400000US42101000103 42101000103
4  4 1400000US42101000104 42101000104
5  5 1400000US42101000105 42101000105

huangapple
  • 本文由 发表于 2023年3月7日 09:18:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657246.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定