英文:
Expanding a table to accomodate for raneg if dates in R or Python?
问题
I have two tables.
高频表1
年份 | 周数 | 利率 |
---|---|---|
2006-07 | 1 | 40 |
2006-07 | 2 | 40 |
2006-07 | 3 | 40 |
2006-07 | 4 | 41 |
2006-07 | 5 | 41 |
低频表2
开始日期 | 结束日期 | 升数 |
---|---|---|
1-Jul-06 | 31-Jul-06 | 40,000 |
1 Aug -06 | 31 Jan 07 | 42,000 |
1 Feb 07 | 31 Jul 07 | 40,000 |
我想要将表1中的利率和表2中的升数合并,以便低频表扩展并在范围内的所有周重复40,000升。
我应该如何在R或Python中实现这个目标?
我将非常感谢您的专业建议。
在Python中,我尝试了以下代码:
import pandas as pd
import numpy as np
low_freq_table = pd.read_csv(r"/content/low freq.csv")
low_freq_table.info()
# 列信息
# 列名 非空数 数据类型
# 0 Tariff Start Date 20 non-null object
# 1 Tariff End Date 20 non-null object
# 2 Excise Rate 20 non-null float64
low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
expanded_table = pd.DataFrame()
for _, row in low_freq_table.iterrows():
start_date = row['Tariff Start Date']
end_date = row['Tariff End Date']
Rate = row['Excise Rate']
weeks = pd.date_range(start=start_date, end=end_date, freq='W')
temp_table = pd.DataFrame({
'Year': weeks.strftime('%Y-%m'),
'Week Number': weeks.week,
'Interest': Rate
})
expanded_table = expanded_table.append(temp_table, ignore_index=True)
high_freq_table = pd.read_csv(r"/content/high freq.csv")
high_freq_table.info()
# 合并表格
merged_table = pd.merge(high_freq_table, expanded_table, on=['Year', 'Week Number'], how='left')
print(merged_table)
但是在合并表格时遇到了问题,因为周数每年都会重复:
KeyError: 'Week'
我将非常感谢任何建议。
英文:
I have two tables.
A High frequency Table 1
Year | Week Number | Interest |
---|---|---|
2006-07 | 1 | 40 |
2006-07 | 2 | 40 |
2006-07 | 3 | 40 |
2006-07 | 4 | 41 |
2006-07 | 5 | 41 |
A low frequency Table 2
start date | end date | litres |
---|---|---|
1- Jul-06 | 31-Jul-06 | 40,000 |
1 Aug -06 | 31 Jan 07 | 42,000 |
1 Feb 07 | 31 Jul 07 | 40,000 |
I want to join interest and litres into Table 1, so that the low frequency table expands and duplicates litres 40,000 for all weeks in the range.
How can I do this in R or Python?
I will highly appreciate your kind and expert suggestions.
I have tried the following in Python
import pandas as pd
import numpy as np
low_freq_table=pd.read_csv(r"/content/low freq.csv")
low_freq_table.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tariff Start Date 20 non-null object
1 Tariff End Date 20 non-null object
2 Excise Rate 20 non-null float64
low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
expanded_table = pd.DataFrame()
for _, row in low_freq_table.iterrows():
start_date = row['Tariff Start Date']
end_date = row['Tariff End Date']
Rate = row['Excise Rate']
weeks = pd.date_range(start=start_date, end=end_date, freq='W')
temp_table = pd.DataFrame({
'Year': weeks.strftime('%Y-%m'),
'Week Number': weeks.week,
'Interest': Rate
})
expanded_table = expanded_table.append(temp_table, ignore_index=True)
high_freq_table=pd.read_csv(r"/content/high freq.csv")
high_freq_table.info()
The code works upto the expansion. But merging does not work because the week numbers repeat every year,
merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
print(merged_table)
But get the following error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-666ff1fd1f5c> in <cell line: 1>()
----> 1 merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
2 print(merged_table)
3 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _get_label_or_level_values(self, key, axis)
1848 )
1849 else:
-> 1850 raise KeyError(key)
1851
1852 # Check for duplicates
KeyError: 'Week'
I will appeciate any suggestions.
答案1
得分: 1
我将只翻译您提供的代码部分,以下是代码的翻译:
library(lubridate)
library(stringr)
# 载入高频数据
high_freq_df <- read.csv(textConnection('Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41'))
# 清理兴趣列
high_freq_df$Interest <- as.numeric(high_freq_df$Interest)
# 清理周数列
high_freq_df$Week.Number <- as.integer(high_freq_df$Week.Number)
# 函数,接受年份和月份(例如'2007-01')以及周数,返回表示该周开始的日期
to_week_start_date <- function(ym, wk) {
d1 <- sprintf('%s-01', ym) |> as.POSIXct()
d2 <- d1 + months(1)
dts <- seq.POSIXt(d1, d2, by = '7 days')
dts[as.integer(wk)] |>
as.character()
}
# 在高频数据中标记周开始日期
high_freq_df$start_date <- apply(high_freq_df, 1, function(x) {
to_week_start_date(x[1], x[2])
})
# 载入低频数据
low_freq_df <- read.csv(textConnection('start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000'), sep = '|') |>
`colnames<-`(c('start_date', 'end_date', 'litres'))
# 清理起始日期列
low_freq_df$start_date <- str_replace_all(low_freq_df$start_date, '\\s+', '-') |>
str_replace_all('\\-+', '-') |>
as.POSIXct(format = '%e-%b-%y') |>
as.character()
# 清理结束日期列
low_freq_df$end_date <- str_replace_all(low_freq_df$end_date, '\\s+', '-') |>
str_replace_all('\\-+', '-') |>
as.POSIXct(format = '%e-%b-%y') |>
as.character()
# 清理升数列
low_freq_df$litres <- str_remove_all(low_freq_df$litres, ',') |>
as.numeric()
# 将高频和低频数据合并为单个数据框
merged_df <- merge(high_freq_df, low_freq_df, by = 'start_date', all = TRUE)
# 清理列名
colnames(merged_df) <- tolower(colnames(merged_df)) |> str_replace_all('\\.', '_')
# 打印结果
print(merged_df)
希望这对您有所帮助。
英文:
I am going to make some assumptions and treat this question as being very life-like, in that it involves the processing of somewhat messy input data. We note that the date formats are inconsistent across the data so we will transform everything to date format %Y-%m-%d
for the sake of consistency. We will also assume that the "Year" and "Week Number" columns in the high-frequency data can be used to impute a start date for the week in an intuitive way.
library(lubridate)
library(stringr)
# load high-frequency data
high_freq_df <- read.csv(textConnection('Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41'))
# clean interest column
high_freq_df$Interest <- as.numeric(high_freq_df$Interest)
# clean week number column
high_freq_df$Week.Number <- as.integer(high_freq_df$Week.Number)
# function that takes year and month (e.g. '2007-01') and week numbers and returns
# the date representing the start of that week
to_week_start_date <- function(ym, wk) {
d1 <- sprintf('%s-01', ym) |> as.POSIXct()
d2 <- d1 + months(1)
dts <- seq.POSIXt(d1, d2, by = '7 days')
dts[ as.integer(wk) ] |>
as.character()
}
# tag high-frequency df with week start date
high_freq_df$start_date <- apply(high_freq_df, 1, function(x) {
to_week_start_date(x[ 1 ], x[ 2 ])
})
# load low-frequency data
low_freq_df <- read.csv(textConnection('start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000'), sep = '|') |>
`colnames<-`(c('start_date', 'end_date', 'litres'))
# clean start_date column
low_freq_df$start_date <- str_replace_all(low_freq_df$start_date, '\\s+', '-') |>
str_replace_all('\\-+', '-') |>
as.POSIXct(format = '%e-%b-%y') |>
as.character()
# clean end_date column
low_freq_df$end_date <- str_replace_all(low_freq_df$end_date, '\\s+', '-') |>
str_replace_all('\\-+', '-') |>
as.POSIXct(format = '%e-%b-%y') |>
as.character()
# clean litres column
low_freq_df$litres <- str_remove_all(low_freq_df$litres, ',') |>
as.numeric()
# combine high and low-frequency data into single data.frame
merged_df <- merge(high_freq_df, low_freq_df, by = 'start_date', all = TRUE)
# clean colnames
colnames(merged_df) <- tolower(colnames(merged_df)) |> str_replace_all('\\.', '_')
# print result
print(merged_df)
start_date year week_number interest end_date litres
1 2006-07-01 2006-07 1 40 2006-07-31 40000
2 2006-07-08 2006-07 2 40 <NA> NA
3 2006-07-15 2006-07 3 40 <NA> NA
4 2006-07-22 2006-07 4 41 <NA> NA
5 2006-07-29 2006-07 5 41 <NA> NA
6 2006-08-01 <NA> NA NA 2007-01-31 42000
7 2007-02-01 <NA> NA NA 2007-07-31 40000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论