扩展表格以适应R或Python中的日期范围?

huangapple go评论75阅读模式
英文:

Expanding a table to accomodate for raneg if dates in R or Python?

问题

I have two tables.

高频表1

年份 周数 利率
2006-07 1 40
2006-07 2 40
2006-07 3 40
2006-07 4 41
2006-07 5 41

低频表2

开始日期 结束日期 升数
1-Jul-06 31-Jul-06 40,000
1 Aug -06 31 Jan 07 42,000
1 Feb 07 31 Jul 07 40,000

我想要将表1中的利率和表2中的升数合并,以便低频表扩展并在范围内的所有周重复40,000升。

我应该如何在R或Python中实现这个目标?

我将非常感谢您的专业建议。

在Python中,我尝试了以下代码:

import pandas as pd
import numpy as np
low_freq_table = pd.read_csv(r"/content/low freq.csv")
low_freq_table.info()

# 列信息
#   列名                非空数   数据类型  
# 0  Tariff Start Date  20 non-null     object 
# 1  Tariff End Date    20 non-null     object 
# 2  Excise Rate        20 non-null     float64

low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
expanded_table = pd.DataFrame()

for _, row in low_freq_table.iterrows():
    start_date = row['Tariff Start Date']
    end_date = row['Tariff End Date']
    Rate = row['Excise Rate']
    weeks = pd.date_range(start=start_date, end=end_date, freq='W')
    temp_table = pd.DataFrame({
        'Year': weeks.strftime('%Y-%m'),
        'Week Number': weeks.week,
        'Interest': Rate
    })
    expanded_table = expanded_table.append(temp_table, ignore_index=True)

high_freq_table = pd.read_csv(r"/content/high freq.csv")
high_freq_table.info()

# 合并表格
merged_table = pd.merge(high_freq_table, expanded_table, on=['Year', 'Week Number'], how='left')
print(merged_table)

但是在合并表格时遇到了问题,因为周数每年都会重复:

KeyError: 'Week'

我将非常感谢任何建议。

英文:

I have two tables.

A High frequency Table 1

Year Week Number Interest
2006-07 1 40
2006-07 2 40
2006-07 3 40
2006-07 4 41
2006-07 5 41

A low frequency Table 2

start date end date litres
1- Jul-06 31-Jul-06 40,000
1 Aug -06 31 Jan 07 42,000
1 Feb 07 31 Jul 07 40,000

I want to join interest and litres into Table 1, so that the low frequency table expands and duplicates litres 40,000 for all weeks in the range.

How can I do this in R or Python?

I will highly appreciate your kind and expert suggestions.

I have tried the following in Python

import pandas as pd
import numpy as np
low_freq_table=pd.read_csv(r"/content/low freq.csv")
low_freq_table.info()




#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Tariff Start Date  20 non-null     object 
 1   Tariff End Date    20 non-null     object 
 2   Excise Rate        20 non-null     float64

low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
expanded_table = pd.DataFrame()
for _, row in low_freq_table.iterrows():
    start_date = row['Tariff Start Date']
    end_date = row['Tariff End Date']
    Rate = row['Excise Rate']
    weeks = pd.date_range(start=start_date, end=end_date, freq='W')
    temp_table = pd.DataFrame({
        'Year': weeks.strftime('%Y-%m'),
        'Week Number': weeks.week,
        'Interest': Rate
    })
    expanded_table = expanded_table.append(temp_table, ignore_index=True)
high_freq_table=pd.read_csv(r"/content/high freq.csv")
high_freq_table.info()

The code works upto the expansion. But merging does not work because the week numbers repeat every year,

merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
print(merged_table)

But get the following error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-666ff1fd1f5c> in <cell line: 1>()
----> 1 merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
2 print(merged_table)
3 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _get_label_or_level_values(self, key, axis)
1848 )
1849 else:
-> 1850 raise KeyError(key)
1851
1852 # Check for duplicates
KeyError: 'Week'

I will appeciate any suggestions.

答案1

得分: 1

我将只翻译您提供的代码部分,以下是代码的翻译:

library(lubridate)
library(stringr)

# 载入高频数据
high_freq_df <- read.csv(textConnection('Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41'))

# 清理兴趣列
high_freq_df$Interest <- as.numeric(high_freq_df$Interest)

# 清理周数列
high_freq_df$Week.Number <- as.integer(high_freq_df$Week.Number)

# 函数,接受年份和月份(例如'2007-01')以及周数,返回表示该周开始的日期
to_week_start_date <- function(ym, wk) {
  d1 <- sprintf('%s-01', ym) |> as.POSIXct()
  d2 <- d1 + months(1)
  dts <- seq.POSIXt(d1, d2, by = '7 days')
  dts[as.integer(wk)] |> 
    as.character()
}

# 在高频数据中标记周开始日期
high_freq_df$start_date <- apply(high_freq_df, 1, function(x) {
  to_week_start_date(x[1], x[2])
})

# 载入低频数据
low_freq_df <- read.csv(textConnection('start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000'), sep = '|') |>
  `colnames<-`(c('start_date', 'end_date', 'litres'))

# 清理起始日期列
low_freq_df$start_date <- str_replace_all(low_freq_df$start_date, '\\s+', '-') |> 
  str_replace_all('\\-+', '-') |> 
  as.POSIXct(format = '%e-%b-%y') |> 
  as.character()

# 清理结束日期列
low_freq_df$end_date <- str_replace_all(low_freq_df$end_date, '\\s+', '-') |> 
  str_replace_all('\\-+', '-') |> 
  as.POSIXct(format = '%e-%b-%y') |>
  as.character()

# 清理升数列
low_freq_df$litres <- str_remove_all(low_freq_df$litres, ',') |> 
  as.numeric()

# 将高频和低频数据合并为单个数据框
merged_df <- merge(high_freq_df, low_freq_df, by = 'start_date', all = TRUE)

# 清理列名
colnames(merged_df) <- tolower(colnames(merged_df)) |> str_replace_all('\\.', '_') 

# 打印结果
print(merged_df)

希望这对您有所帮助。

英文:

I am going to make some assumptions and treat this question as being very life-like, in that it involves the processing of somewhat messy input data. We note that the date formats are inconsistent across the data so we will transform everything to date format %Y-%m-%d for the sake of consistency. We will also assume that the "Year" and "Week Number" columns in the high-frequency data can be used to impute a start date for the week in an intuitive way.

library(lubridate)
library(stringr)

# load high-frequency data
high_freq_df &lt;- read.csv(textConnection(&#39;Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41&#39;))

# clean interest column
high_freq_df$Interest &lt;- as.numeric(high_freq_df$Interest)

# clean week number column
high_freq_df$Week.Number &lt;- as.integer(high_freq_df$Week.Number)

# function that takes year and month (e.g. &#39;2007-01&#39;) and week numbers and returns 
# the date representing the start of that week
to_week_start_date &lt;- function(ym, wk) {
  d1 &lt;- sprintf(&#39;%s-01&#39;, ym) |&gt; as.POSIXct()
  d2 &lt;- d1 + months(1)
  dts &lt;- seq.POSIXt(d1, d2, by = &#39;7 days&#39;)
  dts[ as.integer(wk) ] |&gt; 
    as.character()
}

# tag high-frequency df with week start date
high_freq_df$start_date &lt;- apply(high_freq_df, 1, function(x) {
  to_week_start_date(x[ 1 ], x[ 2 ])
})

# load low-frequency data
low_freq_df &lt;- read.csv(textConnection(&#39;start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000&#39;), sep = &#39;|&#39;) |&gt;
  `colnames&lt;-`(c(&#39;start_date&#39;, &#39;end_date&#39;, &#39;litres&#39;))

# clean start_date column
low_freq_df$start_date &lt;- str_replace_all(low_freq_df$start_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt; 
  str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt; 
  as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt; 
  as.character()

# clean end_date column
low_freq_df$end_date &lt;- str_replace_all(low_freq_df$end_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt; 
  str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt; 
  as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt;
  as.character()

# clean litres column
low_freq_df$litres &lt;- str_remove_all(low_freq_df$litres, &#39;,&#39;) |&gt; 
  as.numeric()

# combine high and low-frequency data into single data.frame
merged_df &lt;- merge(high_freq_df, low_freq_df, by = &#39;start_date&#39;, all = TRUE)

# clean colnames
colnames(merged_df) &lt;- tolower(colnames(merged_df)) |&gt; str_replace_all(&#39;\\.&#39;, &#39;_&#39;) 

# print result
print(merged_df)
  start_date    year week_number interest   end_date litres
1 2006-07-01 2006-07           1       40 2006-07-31  40000
2 2006-07-08 2006-07           2       40       &lt;NA&gt;     NA
3 2006-07-15 2006-07           3       40       &lt;NA&gt;     NA
4 2006-07-22 2006-07           4       41       &lt;NA&gt;     NA
5 2006-07-29 2006-07           5       41       &lt;NA&gt;     NA
6 2006-08-01    &lt;NA&gt;          NA       NA 2007-01-31  42000
7 2007-02-01    &lt;NA&gt;          NA       NA 2007-07-31  40000

huangapple
  • 本文由 发表于 2023年6月1日 18:29:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定