英文:
Getting ValueError: time data doesn't match format "%Y-%m-%d %H:%M:%S.%f%z" error
问题
我正在尝试在 Pandas 数据帧中从 start_time_ns
减去 end_time_ns
,方法如下:
df['time'] = pd.to_datetime(df['end_time_ns']) - pd.to_datetime(df['start_time_ns'])
其中时间单位为纳秒。我使用以下方式读取 CSV 文件:
pd.read_csv(filename, parse_dates=[2, 3], chunksize=chunksize)
其中列 2 和列 3 分别是 start_time_ns
和 end_time_ns
。这个减法在第一个数据块上运行正常,但在一个大小约为 30GB 的 CSV 文件上应用时出现错误。错误信息如下:
Traceback (most recent call last):
File "2rg.py", line 17, in <module>
df['time'] = pd.to_datetime(df['end_time_ns']) - pd.to_datetime(df['start_time_ns'])
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 1050, in to_datetime
values = convert_listlike(arg._values, format)
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 453, in _convert_listlike_datetimes
return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 484, in _array_strptime_with_fallback
result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
File "pandas/_libs/tslibs/strptime.pyx", line 530, in pandas._libs.tslibs.strptime.array_strptime
File "pandas/_libs/tslibs/strptime.pyx", line 351, in pandas._libs.tslibs.strptime.array_strptime
ValueError: time data "2023-06-20 20:41:11+00:00" doesn't match format "%Y-%m-%d %H:%M:%S.%f%z", at position 816780. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
我还粘贴了第 816780 行的信息:
NGN,NGN,2023-06-20 20:30:08.305255+00:00,2023-06-20 20:41:08.317472+00:00,131.243.51.211,144.195.208.70,49851,8801,active,UDP,503,0.000107876,"[0, 52, 46, 405, 0, 0, 0, 0]"
NGN,NGN,2023-06-20 20:40:53.903338+00:00,2023-06-20 20:41:11+00:00,2001:400:0:40::200:205,2001:400:211:81::d1,161,56640,idle,UDP,503,0.0001688016,"[0, 0, 0, 503, 0, 0, 0, 0]"
NGN,NGN,2023-06-20 20:40:53.890268+00:00,2023-06-20 20:41:10.986850+00:00,2001:400:211:81::d1,2001:400:0:40::200:205,56640,161,idle,UDP,503,4.6164e-05,"[0, 503, 0, 0, 0, 0, 0, 0]"
你可以如何解决这个问题?
英文:
I am trying to subtract start_time_ns
from end_time_ns
in the pandas data frame by using:
df['time'] = pd.to_datetime(df['end_time_ns']) - pd.to_datetime(df['start_time_ns'])
which are given in nanoseconds.
I am reading the csv as pd.read_csv(filename,parse_dates=[2, 3],chunksize=chunksize)
where column 2 and 3 are start_time_ns
and end_time_ns
respectively.
The subtraction works fine for the first chunk, but getting error when applying on 30~GB CSV file. The error I get is :
Traceback (most recent call last):
File "2rg.py", line 17, in <module>
df['time'] = pd.to_datetime(df['end_time_ns']) - pd.to_datetime(df['start_time_ns'])
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 1050, in to_datetime
values = convert_listlike(arg._values, format)
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 453, in _convert_listlike_datetimes
return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
File "/home/nnazarov/.local/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 484, in _array_strptime_with_fallback
result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
File "pandas/_libs/tslibs/strptime.pyx", line 530, in pandas._libs.tslibs.strptime.array_strptime
File "pandas/_libs/tslibs/strptime.pyx", line 351, in pandas._libs.tslibs.strptime.array_strptime
ValueError: time data "2023-06-20 20:41:11+00:00" doesn't match format "%Y-%m-%d %H:%M:%S.%f%z", at position 816780. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
I am pasting line 816780 as well as an information :
NGN,NGN,2023-06-20 20:30:08.305255+00:00,2023-06-20 20:41:08.317472+00:00,131.243.51.211,144.195.208.70,49851,8801,active,UDP,503,0.000107876,"[0, 52, 46, 405, 0, 0, 0, 0]"
NGN,NGN,2023-06-20 20:40:53.903338+00:00,2023-06-20 20:41:11+00:00,2001:400:0:40::200:205,2001:400:211:81::d1,161,56640,idle,UDP,503,0.0001688016,"[0, 0, 0, 503, 0, 0, 0, 0]"
NGN,NGN,2023-06-20 20:40:53.890268+00:00,2023-06-20 20:41:10.986850+00:00,2001:400:211:81::d1,2001:400:0:40::200:205,56640,161,idle,UDP,503,4.6164e-05,"[0, 503, 0, 0, 0, 0, 0, 0]"
How can I resolve the issue?
答案1
得分: 2
IIUC,您混合了带有和不带有指定UTC偏移的日期时间。 [mre]:
import pandas as pd
print(pd.to_datetime(["2023-06-20 20:41:11+00:00",
"2023-06-20 20:41:11",
"2023-06-20 20:41:11.890268+00:00"]))
出现错误:
ValueError: 时间数据"2023-06-20 20:41:11"不匹配格式"%Y-%m-%d %H:%M:%S%z",位于位置1。您可以尝试:
- 如果您的字符串具有一致的格式,则传递"format"参数;
- 如果您的字符串都是ISO8601格式,但不一定完全相同,则传递"format='ISO8601'";
- 传递"format='mixed'",并且格式将分别推断每个元素。您可能需要同时使用"dayfirst"参数。
在pandas v2中,您可以使用关键字utc=True
和format="ISO8601"
的组合来避免错误:
print(pd.__version__)
# 2.0.3
print(
pd.to_datetime(["2023-06-20 20:41:11+00:00",
"2023-06-20 20:41:11",
"2023-06-20 20:41:11.890268+00:00"],
format="ISO8601", utc=True)
)
DatetimeIndex(['2023-06-20 20:41:11+00:00',
'2023-06-20 20:41:11+00:00',
'2023-06-20 20:41:11.890268+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
英文:
IIUC, you have mixed datetimes; with and without the UTC offset specified. [mre]:
import pandas as pd
print(pd.to_datetime(["2023-06-20 20:41:11+00:00",
"2023-06-20 20:41:11",
"2023-06-20 20:41:11.890268+00:00"]))
errors with
ValueError: time data "2023-06-20 20:41:11" doesn't match format "%Y-%m-%d %H:%M:%S%z", at position 1. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
With pandas v2, you could use a combination of keywords utc=True
and format="ISO8601"
to avoid the error:
print(pd.__version__)
# 2.0.3
print(
pd.to_datetime(["2023-06-20 20:41:11+00:00",
"2023-06-20 20:41:11",
"2023-06-20 20:41:11.890268+00:00"],
format="ISO8601", utc=True)
)
DatetimeIndex([ '2023-06-20 20:41:11+00:00',
'2023-06-20 20:41:11+00:00',
'2023-06-20 20:41:11.890268+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论