在一个 pandas 数据框中,当只有年份信息时,设置为该年的第一天。

huangapple go评论97阅读模式
英文:

Set first date of the year when only it has only the year in a pandas dataframe

问题

我有一个名为“date”的列在一个pandas数据框中,这是前10行:

0 22-Oct-2022
1 3-Dec-2019
2 27-Jun-2022
3 2023
4 15-Jul-2017
5 2019
6 7-Sep-2022
7 2021
8 30-Sep-2022
9 17-Aug-2021

我想要将所有这些日期转换为例如:

0 2023-05-19
1 2023-01-20
2 ...

对于那些只有年份的行,我想要设置为例如,如果原始数据框有:

0 2019
1 2021

变为:

5 2019-01-01
7 2021-01-01

换句话说,我想要为这些情况设置年份的第一个日期,但保留原始年份而不是当前年份。

我尝试过:

df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%d-%b-%Y')

但它生成了NaT值。希望你们明白这个情况,我会感激任何修复这个问题的想法。

英文:

I have a column name called "date" in one pandas dataframe, this are the first 10 rows:

  1. 0 22-Oct-2022
  2. 1 3-Dec-2019
  3. 2 27-Jun-2022
  4. 3 2023
  5. 4 15-Jul-2017
  6. 5 2019
  7. 6 7-Sep-2022
  8. 7 2021
  9. 8 30-Sep-2022
  10. 9 17-Aug-2021

I want convert all those dates to for example:

  1. 0 2023-05-19
  2. 1 2023-01-20
  3. 2 ...

and for those rows that only has the YEAR I want set it to for example, if the original df has:

  1. 0 2019
  2. 1 2021

to

  1. 5 2019-01-01
  2. 7 2021-01-01

in other words I mean I want set for this cases set the first date of the year but keeping the original year not the current year.

I tried:

  1. df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%d-%b-%Y')

However it's generating NaT values. I hope that you understand this case guys, I will appreciate any idea to fix this problem

thanks.

答案1

得分: 5

You can set the format as mixed (New in 2.0.0, see GH50972) when calling to_datetime:

> formatstr,默认为 None
>
> "mixed",用于单独推断每个元素的格式。这很冒险,你应该考虑与 dayfirst 一起使用。

df["date"] = pd.to_datetime(df["date"], format="mixed", dayfirst=True)

或者经典的双重日期解析 + fillna

df["date"] = (
pd.to_datetime(df["date"], errors="coerce", format="%Y")
.fillna(pd.to_datetime(df["date"], errors="coerce", dayfirst=True))
)

Output:

print(df)

  1. date

0 2022-10-22
1 2019-12-03
2 2022-06-27
3 2023-01-01
4 2017-07-15
5 2019-01-01
6 2022-09-07
7 2021-01-01
8 2022-09-30
9 2021-08-17

英文:

You can set the format as mixed (New in 2.0.0, see GH50972) when calling to_datetime :

> format : str, default None
>
> "mixed", to infer the format for each element individually. This is
> risky, and you should probably use it along with dayfirst.

  1. df["date"] = pd.to_datetime(df["date"], format="mixed", dayfirst=True)

Or a classical double date-parsing + fillna :

  1. df["date"] = (
  2. pd.to_datetime(df["date"], errors="coerce", format="%Y")
  3. .fillna(pd.to_datetime(df["date"], errors="coerce", dayfirst=True))
  4. )

Output :

  1. print(df)
  2. date
  3. 0 2022-10-22
  4. 1 2019-12-03
  5. 2 2022-06-27
  6. 3 2023-01-01
  7. 4 2017-07-15
  8. 5 2019-01-01
  9. 6 2022-09-07
  10. 7 2021-01-01
  11. 8 2022-09-30
  12. 9 2021-08-17

答案2

得分: 1

你需要手动更新数值,首先可以将仅包含年份的行标准化,如下所示:

  1. condition = data['date'].str.len() == 4
  2. data.loc[condition, 'date'] = '1-Jan-' + data['date'].astype(str)

然后尝试在结果上使用 to_datetime 函数。

英文:

You'll have to update the values manually, first you can standarize the rows that only have the year like this:

  1. condition = data['date'].length == 4
  2. data.loc[condition, 'date'] = '1-Jan' + df['date'].astype(str)

and then try to use the to_datetime function on the result

huangapple
  • 本文由 发表于 2023年5月25日 07:06:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327919.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定