2023年6月1日 18:29:35go评论110阅读模式

英文:

Expanding a table to accomodate for raneg if dates in R or Python?

问题

I have two tables.

高频表1

年份	周数	利率
2006-07	1	40
2006-07	2	40
2006-07	3	40
2006-07	4	41
2006-07	5	41

低频表2

开始日期	结束日期	升数
1-Jul-06	31-Jul-06	40,000
1 Aug -06	31 Jan 07	42,000
1 Feb 07	31 Jul 07	40,000

我想要将表1中的利率和表2中的升数合并，以便低频表扩展并在范围内的所有周重复40,000升。

我应该如何在R或Python中实现这个目标？

我将非常感谢您的专业建议。

在Python中，我尝试了以下代码：

import pandas as pd
import numpy as np
low_freq_table = pd.read_csv(r"/content/low freq.csv")
low_freq_table.info()
# 列信息
#   列名                非空数   数据类型  
# 0  Tariff Start Date  20 non-null     object 
# 1  Tariff End Date    20 non-null     object 
# 2  Excise Rate        20 non-null     float64
low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
expanded_table = pd.DataFrame()
for _, row in low_freq_table.iterrows():
    start_date = row['Tariff Start Date']
    end_date = row['Tariff End Date']
    Rate = row['Excise Rate']
    weeks = pd.date_range(start=start_date, end=end_date, freq='W')
    temp_table = pd.DataFrame({
        'Year': weeks.strftime('%Y-%m'),
        'Week Number': weeks.week,
        'Interest': Rate
    })
    expanded_table = expanded_table.append(temp_table, ignore_index=True)
high_freq_table = pd.read_csv(r"/content/high freq.csv")
high_freq_table.info()
# 合并表格
merged_table = pd.merge(high_freq_table, expanded_table, on=['Year', 'Week Number'], how='left')
print(merged_table)

但是在合并表格时遇到了问题，因为周数每年都会重复：

KeyError: 'Week'

我将非常感谢任何建议。

英文:

I have two tables.

A High frequency Table 1

Year	Week Number	Interest
2006-07	1	40
2006-07	2	40
2006-07	3	40
2006-07	4	41
2006-07	5	41

A low frequency Table 2

start date	end date	litres
1- Jul-06	31-Jul-06	40,000
1 Aug -06	31 Jan 07	42,000
1 Feb 07	31 Jul 07	40,000

I want to join interest and litres into Table 1, so that the low frequency table expands and duplicates litres 40,000 for all weeks in the range.

How can I do this in R or Python?

I will highly appreciate your kind and expert suggestions.

I have tried the following in Python

import pandas as pd
import numpy as np
low_freq_table=pd.read_csv(r&quot;/content/low freq.csv&quot;)
low_freq_table.info()
#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Tariff Start Date  20 non-null     object 
 1   Tariff End Date    20 non-null     object 
 2   Excise Rate        20 non-null     float64
low_freq_table[&#39;Tariff Start Date&#39;] = pd.to_datetime(low_freq_table[&#39;Tariff Start Date&#39;], format=&#39;%d-%b-%y&#39;)
low_freq_table[&#39;Tariff End Date&#39;] = pd.to_datetime(low_freq_table[&#39;Tariff End Date&#39;], format=&#39;%d-%b-%y&#39;)
expanded_table = pd.DataFrame()
for _, row in low_freq_table.iterrows():
    start_date = row[&#39;Tariff Start Date&#39;]
    end_date = row[&#39;Tariff End Date&#39;]
    Rate = row[&#39;Excise Rate&#39;]
    weeks = pd.date_range(start=start_date, end=end_date, freq=&#39;W&#39;)
    temp_table = pd.DataFrame({
        &#39;Year&#39;: weeks.strftime(&#39;%Y-%m&#39;),
        &#39;Week Number&#39;: weeks.week,
        &#39;Interest&#39;: Rate
    })
    expanded_table = expanded_table.append(temp_table, ignore_index=True)
high_freq_table=pd.read_csv(r&quot;/content/high freq.csv&quot;)
high_freq_table.info()

The code works upto the expansion. But merging does not work because the week numbers repeat every year,

merged_table = pd.merge(high_freq_table, expanded_table, on=[&#39;Year&#39;,&#39;Week&#39;], how=&#39;left&#39;)
print(merged_table)

But get the following error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-666ff1fd1f5c> in <cell line: 1>()
----> 1 merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
2 print(merged_table)
3 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _get_label_or_level_values(self, key, axis)
1848 )
1849 else:
-> 1850 raise KeyError(key)
1851
1852 # Check for duplicates
KeyError: 'Week'

I will appeciate any suggestions.

答案1

得分: 1

我将只翻译您提供的代码部分，以下是代码的翻译：

library(lubridate)
library(stringr)
# 载入高频数据
high_freq_df <- read.csv(textConnection('Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41'))
# 清理兴趣列
high_freq_df$Interest <- as.numeric(high_freq_df$Interest)
# 清理周数列
high_freq_df$Week.Number <- as.integer(high_freq_df$Week.Number)
# 函数，接受年份和月份（例如'2007-01'）以及周数，返回表示该周开始的日期
to_week_start_date <- function(ym, wk) {
  d1 <- sprintf('%s-01', ym) |> as.POSIXct()
  d2 <- d1 + months(1)
  dts <- seq.POSIXt(d1, d2, by = '7 days')
  dts[as.integer(wk)] |> 
    as.character()
}
# 在高频数据中标记周开始日期
high_freq_df$start_date <- apply(high_freq_df, 1, function(x) {
  to_week_start_date(x[1], x[2])
})
# 载入低频数据
low_freq_df <- read.csv(textConnection('start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000'), sep = '|') |>
  `colnames<-`(c('start_date', 'end_date', 'litres'))
# 清理起始日期列
low_freq_df$start_date <- str_replace_all(low_freq_df$start_date, '\\s+', '-') |> 
  str_replace_all('\\-+', '-') |> 
  as.POSIXct(format = '%e-%b-%y') |> 
  as.character()
# 清理结束日期列
low_freq_df$end_date <- str_replace_all(low_freq_df$end_date, '\\s+', '-') |> 
  str_replace_all('\\-+', '-') |> 
  as.POSIXct(format = '%e-%b-%y') |>
  as.character()
# 清理升数列
low_freq_df$litres <- str_remove_all(low_freq_df$litres, ',') |> 
  as.numeric()
# 将高频和低频数据合并为单个数据框
merged_df <- merge(high_freq_df, low_freq_df, by = 'start_date', all = TRUE)
# 清理列名
colnames(merged_df) <- tolower(colnames(merged_df)) |> str_replace_all('\\.', '_') 
# 打印结果
print(merged_df)

希望这对您有所帮助。

英文:

I am going to make some assumptions and treat this question as being very life-like, in that it involves the processing of somewhat messy input data. We note that the date formats are inconsistent across the data so we will transform everything to date format %Y-%m-%d for the sake of consistency. We will also assume that the "Year" and "Week Number" columns in the high-frequency data can be used to impute a start date for the week in an intuitive way.

library(lubridate)
library(stringr)
# load high-frequency data
high_freq_df &lt;- read.csv(textConnection(&#39;Year,Week Number,Interest
2006-07,1,40
2006-07,2,40
2006-07,3,40
2006-07,4,41
2006-07,5,41&#39;))
# clean interest column
high_freq_df$Interest &lt;- as.numeric(high_freq_df$Interest)
# clean week number column
high_freq_df$Week.Number &lt;- as.integer(high_freq_df$Week.Number)
# function that takes year and month (e.g. &#39;2007-01&#39;) and week numbers and returns 
# the date representing the start of that week
to_week_start_date &lt;- function(ym, wk) {
  d1 &lt;- sprintf(&#39;%s-01&#39;, ym) |&gt; as.POSIXct()
  d2 &lt;- d1 + months(1)
  dts &lt;- seq.POSIXt(d1, d2, by = &#39;7 days&#39;)
  dts[ as.integer(wk) ] |&gt; 
    as.character()
}
# tag high-frequency df with week start date
high_freq_df$start_date &lt;- apply(high_freq_df, 1, function(x) {
  to_week_start_date(x[ 1 ], x[ 2 ])
})
# load low-frequency data
low_freq_df &lt;- read.csv(textConnection(&#39;start date|end date|litres
1- Jul-06|31-Jul-06|40,000
1 Aug -06|31 Jan 07|42,000
1 Feb 07|31 Jul 07|40,000&#39;), sep = &#39;|&#39;) |&gt;
  `colnames&lt;-`(c(&#39;start_date&#39;, &#39;end_date&#39;, &#39;litres&#39;))
# clean start_date column
low_freq_df$start_date &lt;- str_replace_all(low_freq_df$start_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt; 
  str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt; 
  as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt; 
  as.character()
# clean end_date column
low_freq_df$end_date &lt;- str_replace_all(low_freq_df$end_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt; 
  str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt; 
  as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt;
  as.character()
# clean litres column
low_freq_df$litres &lt;- str_remove_all(low_freq_df$litres, &#39;,&#39;) |&gt; 
  as.numeric()
# combine high and low-frequency data into single data.frame
merged_df &lt;- merge(high_freq_df, low_freq_df, by = &#39;start_date&#39;, all = TRUE)
# clean colnames
colnames(merged_df) &lt;- tolower(colnames(merged_df)) |&gt; str_replace_all(&#39;\\.&#39;, &#39;_&#39;) 
# print result
print(merged_df)

  start_date    year week_number interest   end_date litres
1 2006-07-01 2006-07           1       40 2006-07-31  40000
2 2006-07-08 2006-07           2       40       &lt;NA&gt;     NA
3 2006-07-15 2006-07           3       40       &lt;NA&gt;     NA
4 2006-07-22 2006-07           4       41       &lt;NA&gt;     NA
5 2006-07-29 2006-07           5       41       &lt;NA&gt;     NA
6 2006-08-01    &lt;NA&gt;          NA       NA 2007-01-31  42000
7 2007-02-01    &lt;NA&gt;          NA       NA 2007-07-31  40000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

扩展表格以适应R或Python中的日期范围？

问题

答案1

How can I change the first element of a list?

Issue with appending a list in Python.

PyPDF4 – 导出的PDF文件大小过大

这个错误是什么？数值错误，使用基数10时int()无效文字：”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。