从字符串中提取不同格式的日期在R中。

huangapple go评论81阅读模式
英文:

Extract dates in various formats from string in R

问题

我需要快速从字符向量中提取日期。
我有两个主要问题:

  • 各种日期格式(欧洲和美国,字母和数字...)
  • 每个向量中有多个日期。

我的向量如下所示:

c("11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00     ")

我尝试过使用parse_datestrptime但没有成功。
我不了解正则表达式语法,也没有时间深入研究。

非常感谢你的帮助。

英文:

I need to quickly extract dates from character vectors.
I have 2 main issues:

  • Various date formats (European and American, alphanumeric and numeric...)
  • Multiple dates in each vector.

My vectors are something as follows:

c("11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00     ")

I have tried using parse_date and strptime without success.
I do not know anything about the regex syntax and do not really have time to dig into it.

Warmly thank you for your help.

答案1

得分: 2

We can use str_extract_all to extract all the dates with a pattern of two digits followed by /, followed by two digits, / and then four digits

library(stringr)
str_extract_all(v1, "\\d{2}/\\d{2}/\\d{4}")[[1]]

###data

v1 <- c("11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00 ")
英文:

We can use str_extract_all to extract all the dates with a pattern of two digits followed by /, followed by two digits, / and then four digits

library(stringr)
str_extract_all(v1, &quot;\\d{2}/\\d{2}/\\d{4}&quot;)[[1]]

###data

v1 &lt;-  c(&quot;11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00 &quot;)

答案2

得分: 2

如果你需要R语言中的日期,你需要选择更看重美国日期格式还是欧洲日期格式。

library(tidyverse)
library(lubridate)

v1 <-  c("11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00")

str_extract_all(v1, "\\d{2}/\\d{2}/\\d{4}")[[1]] %>%
  tibble(value = .) %>%
  mutate(american_date = value %>% mdy,
         european_date = value %>% dmy,
         stronger_american = coalesce(american_date,european_date),
         stronger_european = coalesce(european_date,american_date))

警告:有3个日期无法解析。

以下是代码的输出结果:

# A tibble: 4 x 5
  value      american_date european_date stronger_american stronger_european
  <chr>      <date>        <date>        <date>            <date>           
1 11/09/2016 2016-11-09    2016-09-11    2016-11-09        2016-09-11       
2 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28       
3 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28       
4 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28

创建日期:2020-01-06,使用了reprex包 (v0.3.0)。

英文:

If you need R dates, you will need to choose if you value more American or European dates

<!-- language-all: lang-r -->

library(tidyverse)
library(lubridate)
#&gt; 
#&gt; Attaching package: &#39;lubridate&#39;
#&gt; The following object is masked from &#39;package:base&#39;:
#&gt; 
#&gt;     date


v1 &lt;-  c(&quot;11/09/2016 Invoice Number . Date P.O. # Amount Discount Paid Amount 2017/015 10/28/2016 CC6/ $50,000.00 $0.00 $50,000-00 2017/016 10/28/2016 CC67 $50,000.00 $0.00 $50,000-00 2017-017 10/28/2016 CC67 $50,000.00 . $0.00 $50,000.00 TOTALS: $150,000.00 $0.00 $150,000.00&quot;)

str_extract_all(v1, &quot;\\d{2}/\\d{2}/\\d{4}&quot;)[[1]] %&gt;% 
  tibble(value = .) %&gt;% 
  mutate(american_date = value %&gt;% mdy,
         european_date = value %&gt;% dmy,
         stronger_american = coalesce(american_date,european_date),
         stronger_european = coalesce(european_date,american_date))
#&gt; Warning: 3 failed to parse.
#&gt; # A tibble: 4 x 5
#&gt;   value      american_date european_date stronger_american stronger_european
#&gt;   &lt;chr&gt;      &lt;date&gt;        &lt;date&gt;        &lt;date&gt;            &lt;date&gt;           
#&gt; 1 11/09/2016 2016-11-09    2016-09-11    2016-11-09        2016-09-11       
#&gt; 2 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28       
#&gt; 3 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28       
#&gt; 4 10/28/2016 2016-10-28    NA            2016-10-28        2016-10-28

<sup>Created on 2020-01-06 by the reprex package (v0.3.0)</sup>

huangapple
  • 本文由 发表于 2020年1月7日 01:39:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616576.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定