pandas 根据日期和月份创建掩码。

huangapple go评论157阅读模式
英文:

pandas create a mask according to only the day and the month

问题

我有以下数据框:

dates,qq
1900-01-01,1
1900-01-02,2
1900-01-03,3
1900-01-04,1
1900-01-05,2
1901-01-06,5
1901-01-01,2
1901-01-02,2
1901-01-03,1
1901-01-04,4
1901-01-05,5
1901-01-06,6
1902-01-01,7
1902-01-02,1
1902-01-03,1
1902-01-04,2
1902-01-05,4
1902-01-06,5

dates,th
01-01,1
01-02,2
01-03,2
01-04,3
01-05,3
01-06,1

我们称它们为dfr和dfr_t。

在第二个数据框dfr_t中,存储了一些阈值。

我想要将第一个数据框中的值与第二个数据框(dfr_t)中的值按照日期和月份进行比较。
正如你所注意到的,在第二个数据框中,日期列只包含月份和日期,而年份不存在。

我想知道dfr中具有特定日期和月份的值是否小于或等于第二个数据框中(dfr_t)具有相同日期和月份的阈值。

以下是预期的结果:

1900-01-01	1	01-01	1	true
1900-01-02	2	01-02	2	true
1900-01-03	3	01-03	2	false
1900-01-04	1	01-04	3	true
1900-01-05	2	01-05	3	true
1900-01-06	5	01-06	1	false
1901-02-01	2	01-01	1	false
1901-02-02	2	01-02	2	true
1901-02-03	1	01-03	2	true
1901-02-04	4	01-04	3	false
1901-02-05	5	01-05	3	false
1901-02-06	6	01-06	1	false
1903-03-01	7	01-01	1	false
1903-03-02	1	01-02	2	true
1903-01-03	1	01-03	2	true
1903-01-04	2	01-04	3	true
1903-01-05	4	01-05	3	false
1903-01-06	5	01-06	1	false

如你所见,对于日期01-01,与1900-01-01对应的值等于1(结果为True);与1901-01-01对应的值不等于1(结果为False);与1902-01-01对应的值也不等于1(结果为False)。

我是如何读取这些数据框的:

dfr = pd.read_csv('test.csv', sep=',', index_col=0, parse_dates=True)

dfr_t = pd.read_csv('treh.csv', sep=',', index_col=0)

我看到的第一个问题是:

dfr_t.index.dtype
Out[15]: dtype('O')

我无法将其转换为日期索引:

ValueError: time data "1" doesn't match format "%m-%d", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

一种想法是复制dfr_t并将其连接起来。然而,这个选项似乎不够通用。

英文:

I have the following dataframes:

dates,qq
1900-01-01,1
1900-01-02,2
1900-01-03,3
1900-01-04,1
1900-01-05,2
1901-01-06,5
1901-01-01,2
1901-01-02,2
1901-01-03,1
1901-01-04,4
1901-01-05,5
1901-01-06,6
1902-01-01,7
1902-01-02,1
1902-01-03,1
1902-01-04,2
1902-01-05,4
1902-01-06,5

and

dates,th
01-01,1
01-02,2
01-03,2
01-04,3
01-05,3
01-06,1

let's say dfr and dfr_t.

In the second one, dfr_t, some thresholds are stored.

I would like to compare the values of the first dataframe with the ones in the second (dfr_t) according to the day and month.
As you can notice indeed in the second one the dates column has only the month and the day while the year is non present.

I would like to know where a values of dfr with a specific day and month is less or equal to the threshold value defined in the second one (dfr_t) with the same day and month.

The following is the expected outcome:

1900-01-01	1	01-01	1	true
1900-01-02	2	01-02	2	true
1900-01-03	3	01-03	2	false
1900-01-04	1	01-04	3	true
1900-01-05	2	01-05	3	true
1900-01-06	5	01-06	1	false
1901-02-01	2	01-01	1	false
1901-02-02	2	01-02	2	true
1901-02-03	1	01-03	2	true
1901-02-04	4	01-04	3	false
1901-02-05	5	01-05	3	false
1901-02-06	6	01-06	1	false
1903-03-01	7	01-01	1	false
1903-03-02	1	01-02	2	true
1903-01-03	1	01-03	2	true
1903-01-04	2	01-04	3	true
1903-01-05	4	01-05	3	false
1903-01-06	5	01-06	1	false

As you can notice, for the date 01-01 the value corresponding to 1900-01-01 is equal to 1 (result=True); the value corresponding to 1901-01-01 is not equal\less to 1 (result=False); the value corresponding to 1902-01-01 is not equal\less to 1 (result=False).

I have read those dataframe as follow:

dfr   = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)

dfr_t = pd.read_csv('treh.csv', sep=',',index_col=0)

The first issue that I see is the following:

dfr_t.index.dtype
Out[15]: dtype('O')

I can not convert it to dateindex

ValueError: time data "1" doesn't match format "%m-%d", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

One idea could be duplicate dfr_t and concatenate it. This option, however, does not seem general.

答案1

得分: 4

The provided code appears to be performing a merge operation on two DataFrames, dfr and dfr_t, based on the "month-day" or "day" values. It then calculates a "flag" column based on the comparison of "qq" and "th" values. The code is presented in both cases: merging on "month-day" and merging on "day" only.

If you have any specific questions or need further assistance, please let me know.

英文:

The exact logic is unclear given the provided output, however you can probably merge on the month-day, then compare the values:

out = (dfr
  .merge(dfr_t, left_on=dfr.index.strftime('%m-%d'),
         right_index=True, how='left')
  .assign(flag=lambda d: d['qq'].le(d['th']))
)

Output:

    dates     dates_x  qq dates_y   th   flag
0   01-01  1900-01-01   1   01-01  1.0   True
1   01-02  1900-01-02   2   01-02  2.0   True
2   01-03  1900-01-03   3   01-03  2.0  False
3   01-04  1900-01-04   1   01-04  3.0   True
4   01-05  1900-01-05   2   01-05  3.0   True
5   01-06  1900-01-06   5   01-06  1.0  False
6   02-01  1900-02-01   2     NaN  NaN  False
7   02-02  1900-02-02   2     NaN  NaN  False
8   02-03  1900-02-03   1     NaN  NaN  False
9   02-04  1900-02-04   4     NaN  NaN  False
10  02-05  1900-02-05   5     NaN  NaN  False
11  02-06  1900-02-06   6     NaN  NaN  False
12  03-01  1900-03-01   7     NaN  NaN  False
13  03-02  1900-03-02   1     NaN  NaN  False
14  01-03  1900-01-03   1   01-03  2.0   True
15  01-04  1900-01-04   2   01-04  3.0   True
16  01-05  1900-01-05   4   01-05  3.0  False
17  01-06  1900-01-06   5   01-06  1.0  False

matching on the day only

out = (dfr
  .merge(dfr_t,
         left_on=pd.to_datetime(dfr['dates']).dt.strftime('%d'),
         right_on=dfr_t['dates'].str.extract(r'-(\d+)$', expand=False),
         how='left')
  .assign(flag=lambda d: d['qq'].le(d['th']))
)

Output:

   key_0     dates_x  qq dates_y  th   flag
0     01  1900-01-01   1   01-01   1   True
1     02  1900-01-02   2   01-02   2   True
2     03  1900-01-03   3   01-03   2  False
3     04  1900-01-04   1   01-04   3   True
4     05  1900-01-05   2   01-05   3   True
5     06  1900-01-06   5   01-06   1  False
6     01  1900-02-01   2   01-01   1  False
7     02  1900-02-02   2   01-02   2   True
8     03  1900-02-03   1   01-03   2   True
9     04  1900-02-04   4   01-04   3  False
10    05  1900-02-05   5   01-05   3  False
11    06  1900-02-06   6   01-06   1  False
12    01  1900-03-01   7   01-01   1  False
13    02  1900-03-02   1   01-02   2   True
14    03  1900-01-03   1   01-03   2   True
15    04  1900-01-04   2   01-04   3   True
16    05  1900-01-05   4   01-05   3  False
17    06  1900-01-06   5   01-06   1  False

huangapple
  • 本文由 发表于 2023年6月12日 21:45:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457295.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定