扩展表格以适应R或Python中的日期范围?

huangapple go评论110阅读模式
英文:

Expanding a table to accomodate for raneg if dates in R or Python?

问题

I have two tables.

高频表1

年份 周数 利率
2006-07 1 40
2006-07 2 40
2006-07 3 40
2006-07 4 41
2006-07 5 41

低频表2

开始日期 结束日期 升数
1-Jul-06 31-Jul-06 40,000
1 Aug -06 31 Jan 07 42,000
1 Feb 07 31 Jul 07 40,000

我想要将表1中的利率和表2中的升数合并,以便低频表扩展并在范围内的所有周重复40,000升。

我应该如何在R或Python中实现这个目标?

我将非常感谢您的专业建议。

在Python中,我尝试了以下代码:

  1. import pandas as pd
  2. import numpy as np
  3. low_freq_table = pd.read_csv(r"/content/low freq.csv")
  4. low_freq_table.info()
  5. # 列信息
  6. # 列名 非空数 数据类型
  7. # 0 Tariff Start Date 20 non-null object
  8. # 1 Tariff End Date 20 non-null object
  9. # 2 Excise Rate 20 non-null float64
  10. low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
  11. low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
  12. expanded_table = pd.DataFrame()
  13. for _, row in low_freq_table.iterrows():
  14. start_date = row['Tariff Start Date']
  15. end_date = row['Tariff End Date']
  16. Rate = row['Excise Rate']
  17. weeks = pd.date_range(start=start_date, end=end_date, freq='W')
  18. temp_table = pd.DataFrame({
  19. 'Year': weeks.strftime('%Y-%m'),
  20. 'Week Number': weeks.week,
  21. 'Interest': Rate
  22. })
  23. expanded_table = expanded_table.append(temp_table, ignore_index=True)
  24. high_freq_table = pd.read_csv(r"/content/high freq.csv")
  25. high_freq_table.info()
  26. # 合并表格
  27. merged_table = pd.merge(high_freq_table, expanded_table, on=['Year', 'Week Number'], how='left')
  28. print(merged_table)

但是在合并表格时遇到了问题,因为周数每年都会重复:

  1. KeyError: 'Week'

我将非常感谢任何建议。

英文:

I have two tables.

A High frequency Table 1

Year Week Number Interest
2006-07 1 40
2006-07 2 40
2006-07 3 40
2006-07 4 41
2006-07 5 41

A low frequency Table 2

start date end date litres
1- Jul-06 31-Jul-06 40,000
1 Aug -06 31 Jan 07 42,000
1 Feb 07 31 Jul 07 40,000

I want to join interest and litres into Table 1, so that the low frequency table expands and duplicates litres 40,000 for all weeks in the range.

How can I do this in R or Python?

I will highly appreciate your kind and expert suggestions.

I have tried the following in Python

  1. import pandas as pd
  2. import numpy as np
  3. low_freq_table=pd.read_csv(r"/content/low freq.csv")
  4. low_freq_table.info()
  5. # Column Non-Null Count Dtype
  6. --- ------ -------------- -----
  7. 0 Tariff Start Date 20 non-null object
  8. 1 Tariff End Date 20 non-null object
  9. 2 Excise Rate 20 non-null float64
  10. low_freq_table['Tariff Start Date'] = pd.to_datetime(low_freq_table['Tariff Start Date'], format='%d-%b-%y')
  11. low_freq_table['Tariff End Date'] = pd.to_datetime(low_freq_table['Tariff End Date'], format='%d-%b-%y')
  12. expanded_table = pd.DataFrame()
  13. for _, row in low_freq_table.iterrows():
  14. start_date = row['Tariff Start Date']
  15. end_date = row['Tariff End Date']
  16. Rate = row['Excise Rate']
  17. weeks = pd.date_range(start=start_date, end=end_date, freq='W')
  18. temp_table = pd.DataFrame({
  19. 'Year': weeks.strftime('%Y-%m'),
  20. 'Week Number': weeks.week,
  21. 'Interest': Rate
  22. })
  23. expanded_table = expanded_table.append(temp_table, ignore_index=True)
  24. high_freq_table=pd.read_csv(r"/content/high freq.csv")
  25. high_freq_table.info()

The code works upto the expansion. But merging does not work because the week numbers repeat every year,

  1. merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
  2. print(merged_table)

But get the following error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-666ff1fd1f5c> in <cell line: 1>()
----> 1 merged_table = pd.merge(high_freq_table, expanded_table, on=['Year','Week'], how='left')
2 print(merged_table)
3 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _get_label_or_level_values(self, key, axis)
1848 )
1849 else:
-> 1850 raise KeyError(key)
1851
1852 # Check for duplicates
KeyError: 'Week'

I will appeciate any suggestions.

答案1

得分: 1

我将只翻译您提供的代码部分,以下是代码的翻译:

  1. library(lubridate)
  2. library(stringr)
  3. # 载入高频数据
  4. high_freq_df <- read.csv(textConnection('Year,Week Number,Interest
  5. 2006-07,1,40
  6. 2006-07,2,40
  7. 2006-07,3,40
  8. 2006-07,4,41
  9. 2006-07,5,41'))
  10. # 清理兴趣列
  11. high_freq_df$Interest <- as.numeric(high_freq_df$Interest)
  12. # 清理周数列
  13. high_freq_df$Week.Number <- as.integer(high_freq_df$Week.Number)
  14. # 函数,接受年份和月份(例如'2007-01')以及周数,返回表示该周开始的日期
  15. to_week_start_date <- function(ym, wk) {
  16. d1 <- sprintf('%s-01', ym) |> as.POSIXct()
  17. d2 <- d1 + months(1)
  18. dts <- seq.POSIXt(d1, d2, by = '7 days')
  19. dts[as.integer(wk)] |>
  20. as.character()
  21. }
  22. # 在高频数据中标记周开始日期
  23. high_freq_df$start_date <- apply(high_freq_df, 1, function(x) {
  24. to_week_start_date(x[1], x[2])
  25. })
  26. # 载入低频数据
  27. low_freq_df <- read.csv(textConnection('start date|end date|litres
  28. 1- Jul-06|31-Jul-06|40,000
  29. 1 Aug -06|31 Jan 07|42,000
  30. 1 Feb 07|31 Jul 07|40,000'), sep = '|') |>
  31. `colnames<-`(c('start_date', 'end_date', 'litres'))
  32. # 清理起始日期列
  33. low_freq_df$start_date <- str_replace_all(low_freq_df$start_date, '\\s+', '-') |>
  34. str_replace_all('\\-+', '-') |>
  35. as.POSIXct(format = '%e-%b-%y') |>
  36. as.character()
  37. # 清理结束日期列
  38. low_freq_df$end_date <- str_replace_all(low_freq_df$end_date, '\\s+', '-') |>
  39. str_replace_all('\\-+', '-') |>
  40. as.POSIXct(format = '%e-%b-%y') |>
  41. as.character()
  42. # 清理升数列
  43. low_freq_df$litres <- str_remove_all(low_freq_df$litres, ',') |>
  44. as.numeric()
  45. # 将高频和低频数据合并为单个数据框
  46. merged_df <- merge(high_freq_df, low_freq_df, by = 'start_date', all = TRUE)
  47. # 清理列名
  48. colnames(merged_df) <- tolower(colnames(merged_df)) |> str_replace_all('\\.', '_')
  49. # 打印结果
  50. print(merged_df)

希望这对您有所帮助。

英文:

I am going to make some assumptions and treat this question as being very life-like, in that it involves the processing of somewhat messy input data. We note that the date formats are inconsistent across the data so we will transform everything to date format %Y-%m-%d for the sake of consistency. We will also assume that the "Year" and "Week Number" columns in the high-frequency data can be used to impute a start date for the week in an intuitive way.

  1. library(lubridate)
  2. library(stringr)
  3. # load high-frequency data
  4. high_freq_df &lt;- read.csv(textConnection(&#39;Year,Week Number,Interest
  5. 2006-07,1,40
  6. 2006-07,2,40
  7. 2006-07,3,40
  8. 2006-07,4,41
  9. 2006-07,5,41&#39;))
  10. # clean interest column
  11. high_freq_df$Interest &lt;- as.numeric(high_freq_df$Interest)
  12. # clean week number column
  13. high_freq_df$Week.Number &lt;- as.integer(high_freq_df$Week.Number)
  14. # function that takes year and month (e.g. &#39;2007-01&#39;) and week numbers and returns
  15. # the date representing the start of that week
  16. to_week_start_date &lt;- function(ym, wk) {
  17. d1 &lt;- sprintf(&#39;%s-01&#39;, ym) |&gt; as.POSIXct()
  18. d2 &lt;- d1 + months(1)
  19. dts &lt;- seq.POSIXt(d1, d2, by = &#39;7 days&#39;)
  20. dts[ as.integer(wk) ] |&gt;
  21. as.character()
  22. }
  23. # tag high-frequency df with week start date
  24. high_freq_df$start_date &lt;- apply(high_freq_df, 1, function(x) {
  25. to_week_start_date(x[ 1 ], x[ 2 ])
  26. })
  27. # load low-frequency data
  28. low_freq_df &lt;- read.csv(textConnection(&#39;start date|end date|litres
  29. 1- Jul-06|31-Jul-06|40,000
  30. 1 Aug -06|31 Jan 07|42,000
  31. 1 Feb 07|31 Jul 07|40,000&#39;), sep = &#39;|&#39;) |&gt;
  32. `colnames&lt;-`(c(&#39;start_date&#39;, &#39;end_date&#39;, &#39;litres&#39;))
  33. # clean start_date column
  34. low_freq_df$start_date &lt;- str_replace_all(low_freq_df$start_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt;
  35. str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt;
  36. as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt;
  37. as.character()
  38. # clean end_date column
  39. low_freq_df$end_date &lt;- str_replace_all(low_freq_df$end_date, &#39;\\s+&#39;, &#39;-&#39;) |&gt;
  40. str_replace_all(&#39;\\-+&#39;, &#39;-&#39;) |&gt;
  41. as.POSIXct(format = &#39;%e-%b-%y&#39;) |&gt;
  42. as.character()
  43. # clean litres column
  44. low_freq_df$litres &lt;- str_remove_all(low_freq_df$litres, &#39;,&#39;) |&gt;
  45. as.numeric()
  46. # combine high and low-frequency data into single data.frame
  47. merged_df &lt;- merge(high_freq_df, low_freq_df, by = &#39;start_date&#39;, all = TRUE)
  48. # clean colnames
  49. colnames(merged_df) &lt;- tolower(colnames(merged_df)) |&gt; str_replace_all(&#39;\\.&#39;, &#39;_&#39;)
  50. # print result
  51. print(merged_df)
  1. start_date year week_number interest end_date litres
  2. 1 2006-07-01 2006-07 1 40 2006-07-31 40000
  3. 2 2006-07-08 2006-07 2 40 &lt;NA&gt; NA
  4. 3 2006-07-15 2006-07 3 40 &lt;NA&gt; NA
  5. 4 2006-07-22 2006-07 4 41 &lt;NA&gt; NA
  6. 5 2006-07-29 2006-07 5 41 &lt;NA&gt; NA
  7. 6 2006-08-01 &lt;NA&gt; NA NA 2007-01-31 42000
  8. 7 2007-02-01 &lt;NA&gt; NA NA 2007-07-31 40000

huangapple
  • 本文由 发表于 2023年6月1日 18:29:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定