在R中从遵循特定关键字的字符串创建一个数据框。

huangapple go评论68阅读模式
英文:

Create a data frame in R from a string that follows a specific keyword

问题

Date Dividend
05/05/2005 $367.62
05/05/2006 $415.70

英文:

I have a string as such:

"05/05/2005 ANNIVERSARY $367.62 ANNUAL DIVIDEND DECLARED UNDER THE PAIO UP ADDITIONS 20,965 2,203 23,168 | PAID UP ADDITION OPTION. $367.62 PURCHASED PAID UP ADDITIONS OF 2,203 02/15/2006 WITHDRAWAL ($77.50) VALUE OF PAID UP ADDITIONS OF 464 PAID UP ADDITIONS 23,168 (464) 22,704 APPLIED TOWARDS CHECK-O-MATIC PREMIUM DUE 03/05/2006 04/11/2006 05/05/2006 ANNIVERSARY $415.70"

I would like to create a data frame in R to extract the date and dollar amounts after the word ANNIVERSARY for the entire string.
Date Dividend
05/05/2005 $367.62
05/05/2006 $415.70

Thank you in advance.

I tried splitting the string with str_split but don't know where to go from there.

答案1

得分: 1

如果我们只想提取美元金额和日期,我们可以使用 str_extract 与正则表达式查找(或在新版本中使用捕获组)

library(stringr)
library(tibble)
dates <- str_extract_all(str1, "\\d{2}/\\d{2}/\\d{4}(?=\\s+ANNIVERSARY)")[[1]]
amounts <- str_extract_all(str1, "(?<=ANNIVERSARY )\\$[0-9.]+")[[1]]
tibble(dates, amounts)
# A tibble: 2 × 2
  dates      amounts
  <chr>      <chr>  
1 05/05/2005 $367.62
2 05/05/2006 $415.70

或者另一种选择是提取包含 'ANNIVERSARY' 的子字符串,然后使用 read.table/fread 读取

library(data.table)
fread(text = str_extract_all(str1, "\\S+\\s+ANNIVERSARY\\s+\\S+")[[1]], 
   header = FALSE, col.names = c("dates", "amounts"), drop = 2)
       dates amounts
1: 05/05/2005 $367.62
2: 05/05/2006 $415.70

数据

str1 <- "05/05/2005 ANNIVERSARY $367.62 ANNUAL DIVIDEND DECLARED UNDER THE PAIO UP ADDITIONS 20,965 2,203 23,168 | PAID UP ADDITION OPTION. $367.62 PURCHASED PAID UP ADDITIONS OF 2,203 02/15/2006 WITHDRAWAL ($77.50) VALUE OF PAID UP ADDITIONS OF 464 PAID UP ADDITIONS 23,168 (464) 22,704 APPLIED TOWARDS CHECK-O-MATIC PREMIUM DUE 03/05/2006 04/11/2006 05/05/2006 ANNIVERSARY $415.70"
英文:

If we just want to extract the dollar amounts and date, we may use str_extract with a regex lookaround (or in the new version with capture group)

library(stringr)
library(tibble)
dates <- str_extract_all(str1, "\\d{2}/\\d{2}/\\d{4}(?=\\s+ANNIVERSARY)")[[1]]
amounts <- str_extract_all(str1, "(?<=ANNIVERSARY )\$[0-9.]+")[[1]]
tibble(dates, amounts)
# A tibble: 2 × 2
  dates      amounts
  <chr>      <chr>  
1 05/05/2005 $367.62
2 05/05/2006 $415.70

Or another option is to extract the substring containing 'ANNIVERSARY', read with read.table/fread

library(data.table)
fread(text = str_extract_all(str1, "\\S+\\s+ANNIVERSARY\\s+\\S+")[[1]], 
   header = FALSE, col.names = c("dates", "amounts"), drop = 2)
       dates amounts
1: 05/05/2005 $367.62
2: 05/05/2006 $415.70

data

str1 <- "05/05/2005 ANNIVERSARY $367.62 ANNUAL DIVIDEND DECLARED UNDER THE PAIO UP ADDITIONS 20,965 2,203 23,168 | PAID UP ADDITION OPTION. $367.62 PURCHASED PAID UP ADDITIONS OF 2,203 02/15/2006 WITHDRAWAL ($77.50) VALUE OF PAID UP ADDITIONS OF 464 PAID UP ADDITIONS 23,168 (464) 22,704 APPLIED TOWARDS CHECK-O-MATIC PREMIUM DUE 03/05/2006 04/11/2006 05/05/2006 ANNIVERSARY $415.70"

huangapple
  • 本文由 发表于 2023年2月6日 06:29:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355942.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定