如何将CSV文件中的几列复制/连接到R中的现有SQL表格中?

huangapple go评论65阅读模式
英文:

How to copy/join few columns from a csv file to an existing SQL table in R?

问题

我在SQL中创建了一个名为"SQL_table"的表,并使用下面提供的代码将各种现有的.csv文件内容复制到R中的"SQL_table"中。

代码:

library(RPostgres)
library(DBI)

# 步骤2:创建包含所有文件的数据框
file_names1 <- dir("D:/Data/", full.names = TRUE, recursive = T) # 指定文件所在路径
my_data_frame <- do.call(rbind, lapply(file_names1, read.csv))

# 步骤3:使用RPostgres建立R和PostgreSQL连接
dsn_database = "...."   # 指定您的数据库名称
dsn_hostname = "localhost"  # 指定主机名
dsn_port = "...."                # 指定端口号,例如98939
dsn_uid = "...."         # 指定用户名,例如"admin"
dsn_pwd = "...."         # 指定密码,例如"xxx"

tryCatch({
  drv <- dbDriver("Postgres")
  print("连接到数据库中...")
  connec <- dbConnect(drv, 
                      dbname = dsn_database,
                      host = dsn_hostname, 
                      port = dsn_port,
                      user = dsn_uid, 
                      password = dsn_pwd)
  print("数据库已连接!")
},
error=function(cond) {
  print("无法连接到数据库。")
})

# 步骤4:使用RPostgres运行
dbWriteTable(connec, "SQL_table", my_data_frame, create = TRUE)
dbAppendTable(connec, "SQL_table", my_data_frame, create = TRUE)
dbReadTable(connec, "SQL_table")

我的SQL_table的结构如下:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  
2001-01-01 00:00:00  NAM  5000  77  8.8
2002-01-01 00:00:00  NAM  8700  58  7.7
2003-01-01 00:00:00  NAM  3410  98  9.8
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8
1991-01-01 00:00:00  KEN  3200  88  9.7
1991-01-01 00:00:00  KEN  4910  78  8.8
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8
2011-01-01 00:00:00  PUB  4200  89  9.7
2012-01-01 00:00:00  PUB  5910  88  8.8
..
..
..

我有另一个.csv文件"CSV_1",我想要从"CSV_1"复制一些列(纬度,经度,高度,起始日期,结束日期)到"SQL_table"中。两个表之间的公共列是"SQL_table"中的Location_ID和"CSV_1"中的City。

CSV文件的结构如下:

City  Latitude  Longitude  Altitude  Start_Date  End_Date  No. of Events  Event_Type
NAM  35  79  218  3/1/2001    10  Flood
KEN  30  81  129  2/1/1990  5/31/1999  5  Earthquake
PUB  22  76  220  1/1/2010    11  Landslide
..
..

我的期望输出如下:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  Latitude  Longitude  Altitude  Start_Date  End_Date
2001-01-01 00:00:00  NAM  5000  77  8.8  35  79  218  3/1/2001
2002-01-01 00:00:00  NAM  8700  58  7.7  35  79  218  3/1/2001
2003-01-01 00:00:00  NAM  3410  98  9.8  35  79  218  3/1/2001
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  3200  88  9.7  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  4910  78  8.8  30  81  129  2/1/1990  5/31/1999
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8  22  76  220  1/1/2010
2011-01-01 00:00:00  PUB  4200  89  9.7  22  76  220  1/1/2010
2012-01-01 00:00:00  PUB  5910  88  8.8  22  76  220  1/1/2010
..
..
..

请问是否有人可以帮助我扩展R中的代码以获得所需的结果?

英文:

I have a created a table "SQL_table" in SQL and copied contents from various existing .csv files to "SQL_table" in R using the code given below.

Code:

library(RPostgres)
library(DBI)

#Step 2: make a dataframe of all files
file_names1 &lt;- dir(&quot;D:/Data/&quot;, full.names = TRUE, recursive = T) #where you have your files
my_data_frame &lt;- do.call(rbind,lapply(file_names1,read.csv))

#Step 3: Establish R &amp; PostgreSQL Connection using RPostgres
dsn_database = &quot;....&quot;   # Specify the name of your Database
dsn_hostname = &quot;localhost&quot;  # Specify host name
dsn_port = &quot;....&quot;                # Specify your port number. e.g. 98939
dsn_uid = &quot;....&quot;         # Specify your username. e.g. &quot;admin&quot;
dsn_pwd = &quot;....&quot;         # Specify your password. e.g. &quot;xxx&quot;

tryCatch({
  drv &lt;- dbDriver(&quot;Postgres&quot;)
  print(&quot;Connecting to Database…&quot;)
  connec &lt;- dbConnect(drv, 
                      dbname = dsn_database,
                      host = dsn_hostname, 
                      port = dsn_port,
                      user = dsn_uid, 
                      password = dsn_pwd)
  print(&quot;Database Connected!&quot;)
},
error=function(cond) {
  print(&quot;Unable to connect to Database.&quot;)
})

#Step 4: Run using RPostgres
dbWriteTable(connec, &quot;SQL_table&quot;, my_data_frame, create = TRUE)
dbAppendTable(connec, &quot;SQL_table&quot;, my_data_frame, create = TRUE)
dbReadTable(connec, &quot;SQL_table&quot;)

The structure of my SQL_table is as follows:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  
2001-01-01 00:00:00  NAM  5000  77  8.8
2002-01-01 00:00:00  NAM  8700  58  7.7
2003-01-01 00:00:00  NAM  3410  98  9.8
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8
1991-01-01 00:00:00  KEN  3200  88  9.7
1991-01-01 00:00:00  KEN  4910  78  8.8
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8
2011-01-01 00:00:00  PUB  4200  89  9.7
2012-01-01 00:00:00  PUB  5910  88  8.8
..
..
..

I have another .csv file "CSV_1" and I want to copy few columns (Latitude, Longitude, Altitude, Start_Date, End_Date) from "CSV_1" to "SQL_table" using R. The common column between both the tables is Location_ID in "SQL_table" and City in "CSV_1".

The structure of CSV file is as follows:

City  Latitude  Longitude  Altitude  Start_Date  End_Date  No. of Events  Event_Type
NAM  35  79  218  3/1/2001    10  Flood
KEN  30  81  129  2/1/1990  5/31/1999  5  Earthquake
PUB  22  76  220  1/1/2010    11  Landslide
..
..

My desired output is as follows:

Y_m_d_Time  Location_ID  Popul  Literacy  Mortality  Latitude  Longitude  Altitude  Start_Date  End_Date
2001-01-01 00:00:00  NAM  5000  77  8.8  35  79  218  3/1/2001
2002-01-01 00:00:00  NAM  8700  58  7.7  35  79  218  3/1/2001
2003-01-01 00:00:00  NAM  3410  98  9.8  35  79  218  3/1/2001
..
..
1990-01-01 00:00:00  KEN  2000  87  9.8  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  3200  88  9.7  30  81  129  2/1/1990  5/31/1999
1991-01-01 00:00:00  KEN  4910  78  8.8  30  81  129  2/1/1990  5/31/1999
..
..
2010-01-01 00:00:00  PUB  6000  97  9.8  22  76  220  1/1/2010
2011-01-01 00:00:00  PUB  4200  89  9.7  22  76  220  1/1/2010
2012-01-01 00:00:00  PUB  5910  88  8.8  22  76  220  1/1/2010
..
..
..

Could anyone please help me how I can extend my code in R to obtain the desired results.

答案1

得分: 1

假设您的变量Location_ID来自flights表的originCity来自airports表的faa。如果您同意使用{dplyr},您可以简单地执行left_join()

con <- DBI::dbConnect(duckdb::duckdb())

dbplyr::copy_nycflights13(con)
#> 创建表格: airlines
#> 创建表格: airports
#> 创建表格: flights
#> 创建表格: planes
#> 创建表格: weather

flights <- dplyr::tbl(con, "flights")
airports <- dplyr::tbl(con, "airports")

flights |>
  dplyr::left_join(airports, by = c("origin" = "faa"))
#> # 源:   SQL [?? x 26]
#> # 数据库: DuckDB 0.8.1 [root@Darwin 22.5.0:R 4.3.1/:memory:]
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     2    14     1045           1056       -11     1204           1230
#>  2  2013     2    14     1048           1049        -1     1412           1400
#>  3  2013     2    14     1048           1040         8     1333           1337
#>  4  2013     2    14     1051           1100        -9     1424           1424
#>  5  2013     2    14     1051           1100        -9     1203           1214
#>  6  2013     2    14     1057           1100        -3     1408           1420
#>  7  2013     2    14     1057           1100        -3     1244           1300
#>  8  2013     2    14     1057           1059        -2     1353           1441
#>  9  2013     2    14     1058           1056         2     1211           1220
#> 10  2013     2    14     1058           1100        -2     1337           1338
#> # 更多行
#> # 18个更多的变量: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>,
#> #   lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

DBI::dbDisconnect(con, shutdown = TRUE)

请注意,上述代码是用R语言编写的,它使用{dplyr}和数据库连接来执行左连接操作。如果您有任何其他问题或需要进一步的帮助,请随时提出。

英文:

Suppose your variable Location_ID is origin from the flights table and City is faa from the airports table. If you agreed to use {dplyr}, you could simply do a left_join().

con &lt;- DBI::dbConnect(duckdb::duckdb())

dbplyr::copy_nycflights13(con)
#&gt; Creating table: airlines
#&gt; Creating table: airports
#&gt; Creating table: flights
#&gt; Creating table: planes
#&gt; Creating table: weather

flights &lt;- dplyr::tbl(con, &quot;flights&quot;)
airports &lt;- dplyr::tbl(con, &quot;airports&quot;)

flights |&gt;
  dplyr::left_join(airports, by = c(&quot;origin&quot; = &quot;faa&quot;))
#&gt; # Source:   SQL [?? x 26]
#&gt; # Database: DuckDB 0.8.1 [root@Darwin 22.5.0:R 4.3.1/:memory:]
#&gt;     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt;    &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;
#&gt;  1  2013     2    14     1045           1056       -11     1204           1230
#&gt;  2  2013     2    14     1048           1049        -1     1412           1400
#&gt;  3  2013     2    14     1048           1040         8     1333           1337
#&gt;  4  2013     2    14     1051           1100        -9     1424           1424
#&gt;  5  2013     2    14     1051           1100        -9     1203           1214
#&gt;  6  2013     2    14     1057           1100        -3     1408           1420
#&gt;  7  2013     2    14     1057           1100        -3     1244           1300
#&gt;  8  2013     2    14     1057           1059        -2     1353           1441
#&gt;  9  2013     2    14     1058           1056         2     1211           1220
#&gt; 10  2013     2    14     1058           1100        -2     1337           1338
#&gt; # ℹ more rows
#&gt; # ℹ 18 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; #   tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; #   hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, name &lt;chr&gt;, lat &lt;dbl&gt;,
#&gt; #   lon &lt;dbl&gt;, alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;

DBI::dbDisconnect(con, shutdown = TRUE)

huangapple
  • 本文由 发表于 2023年7月24日 19:54:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76754233.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定