“R” 和 SQL 用于查询大型 JSON 数据

huangapple go评论67阅读模式
英文:

"R" and SQL to query large JSON data

问题

我有一个大的JSON文件(383,255KB)。它包含街道名称、地址和纬度/经度值。
我认为我说得对,这是WGS84世界大地测量系统84文件格式。

我只想在“R”中查询这些数据。
我正在考虑使用SQL,因为在这种情况下它很高效,但我不知道有哪些包/方法可以采用这种方法。

我想要做的是:

  select Latitude, Longitude From JSONData Where Address = "160" and LFNAME = "Transit Rd"

JSON数据文件的结构(以及一些相关的数据块)如下所示:

{
	"type" : "FeatureCollection",
	"name" : "ADDRESS_POINT_WGS84",
	"features" : [
		{
			"type" : "Feature",
			"geometry" : {
				"type" : "Point",
				"coordinates" : [ -79.539538708, 43.586921565 ]
			},
			"properties" : {
				"GEO_ID" : 5729529,
				"LINK" : 5729516,
				"MAINT_STAG" : "REGULAR",
				"ADDRESS" : "399",
				"LFNAME" : "Lake Promenade",
				"LO_NUM" : 399,
				"LONUMSUF" : null,
				"HINUM" : null,
				"HINUMSUF" : null,
				"ARC_SIDE" : "L",
				"DISTANCE" : 4.15,
				"FCODE" : 115001,
				"FCODE_DES" : "Unknown",
				"CLASS" : "Land",
				"NAME" : null,
				"X" : 301607.151,
				"Y" : 4827443.931,
				"LONGITUDE" : -79.5395419864,
				"LATITUDE" : 43.5869301259,
				"OBJECTID" : 1491585,
				"MUN_NAME" : "Etobicoke",
				"WARD_NAME" : "Etobicoke-Lakeshore"
			}
		},
        ...
	]
}

我将查找成千上万个城市地址的纬度/经度值,因此将数据存储在内存中是有意义的。

我还没有完全阅读这篇文章:https://themockup.blog/posts/2020-05-22-parsing-json-in-r-with-jsonlite/

我已经阅读了关于“矩形化”的文章:https://tidyr.tidyverse.org/articles/rectangle.html。
我尝试了一些方法,但对我来说似乎过于复杂。

我一定是走错了路,老实说,我不知道该采取什么方法。
我只是要确定地址的纬度/经度值。
一旦我获得了“项目”所需的纬度/经度值,我将在地图上显示这些项目。

难道没有更好的方法吗?

感谢您的任何提示/建议/指导。

Richard

英文:

I have a large JSON file (383,255KB). It contains street names, addresses, and lat/long values.
I think I'm right in saying this is a WGS84 World Geodetic System 84 file format.

All I want to do is query this data, in "R".
I'm thinking in SQL, because it is efficient in this context, but
I am not aware of any package/method to take this approach.

I would like to:

  select Latitude, Longitude From JSONData Where Address = "160" and LFNAME = "Transit Rd"

The structure (and a few relevant data blocks) of the JSON data file are shown here:

{
	"type" : "FeatureCollection",
	"name" : "ADDRESS_POINT_WGS84",
	"features" : [
		{
			"type" : "Feature",
			"geometry" : {
				"type" : "Point",
				"coordinates" : [ -79.539538708, 43.586921565 ]
			},
			"properties" : {
				"GEO_ID" : 5729529,
				"LINK" : 5729516,
				"MAINT_STAG" : "REGULAR",
				"ADDRESS" : "399",
				"LFNAME" : "Lake Promenade",
				"LO_NUM" : 399,
				"LONUMSUF" : null,
				"HINUM" : null,
				"HINUMSUF" : null,
				"ARC_SIDE" : "L",
				"DISTANCE" : 4.15,
				"FCODE" : 115001,
				"FCODE_DES" : "Unknown",
				"CLASS" : "Land",
				"NAME" : null,
				"X" : 301607.151,
				"Y" : 4827443.931,
				"LONGITUDE" : -79.5395419864,
				"LATITUDE" : 43.5869301259,
				"OBJECTID" : 1491585,
				"MUN_NAME" : "Etobicoke",
				"WARD_NAME" : "Etobicoke-Lakeshore"
			}
		},
		{
			"type" : "Feature",
			"geometry" : {
				"type" : "Point",
				"coordinates" : [ -79.540371034, 43.5882630139 ]
			},
			"properties" : {
				"GEO_ID" : 9950585,
				"LINK" : 9950578,
				"MAINT_STAG" : "REGULAR",
				"ADDRESS" : "7",
				"LFNAME" : "Hilo Rd",
				"LO_NUM" : 7,
				"LONUMSUF" : null,
				"HINUM" : null,
				"HINUMSUF" : null,
				"ARC_SIDE" : "L",
				"DISTANCE" : 43.77,
				"FCODE" : 115001,
				"FCODE_DES" : "Unknown",
				"CLASS" : "Land",
				"NAME" : null,
				"X" : 301540.011,
				"Y" : 4827592.989,
				"LONGITUDE" : -79.5403743125,
				"LATITUDE" : 43.5882715763,
				"OBJECTID" : 1491586,
				"MUN_NAME" : "Etobicoke",
				"WARD_NAME" : "Etobicoke-Lakeshore"
			}
		},
		{
			"type" : "Feature",
			"geometry" : {
				"type" : "Point",
				"coordinates" : [ -79.5390269742, 43.587174988 ]
			},
			"properties" : {
				"GEO_ID" : 5729496,
				"LINK" : 5729468,
				"MAINT_STAG" : "REGULAR",
				"ADDRESS" : "387",
				"LFNAME" : "Lake Promenade",
				"LO_NUM" : 387,
				"LONUMSUF" : null,
				"HINUM" : null,
				"HINUMSUF" : null,
				"ARC_SIDE" : "L",
				"DISTANCE" : 97.58,
				"FCODE" : 115001,
				"FCODE_DES" : "Unknown",
				"CLASS" : "Land",
				"NAME" : null,
				"X" : 301648.488,
				"Y" : 4827472.065,
				"LONGITUDE" : -79.5390302568,
				"LATITUDE" : 43.5871835468,
				"OBJECTID" : 1491591,
				"MUN_NAME" : "Etobicoke",
				"WARD_NAME" : "Etobicoke-Lakeshore"
			}
		},
		{
			"type" : "Feature",
			"geometry" : {
				"type" : "Point",
				"coordinates" : [ -79.3880422943, 43.6830929049 ]
			},
			"properties" : {
				"GEO_ID" : 850466,
				"LINK" : 20002604,
				"MAINT_STAG" : "REGULAR",
				"ADDRESS" : "1",
				"LFNAME" : "Ottawa St",
				"LO_NUM" : 1,
				"LONUMSUF" : null,
				"HINUM" : null,
				"HINUMSUF" : null,
				"ARC_SIDE" : "R",
				"DISTANCE" : 3.99,
				"FCODE" : 115001,
				"FCODE_DES" : "Unknown",
				"CLASS" : "Land",
				"NAME" : null,
				"X" : 313826.457,
				"Y" : 4838133.322,
				"LONGITUDE" : -79.388042294,
				"LATITUDE" : 43.683092902,
				"OBJECTID" : 4675851,
				"MUN_NAME" : "former Toronto",
				"WARD_NAME" : "University-Rosedale"
			}
		}
	]
}

I will be looking up lat/long values for thousands of city addresses, so it makes sense to have the data in memory.

I haven't completely read this article: https://themockup.blog/posts/2020-05-22-parsing-json-in-r-with-jsonlite/

I've read about "Rectangling" here: https://tidyr.tidyverse.org/articles/rectangle.html.
I monkeyed about, it seems too complicated to me.

I must be going about this the wrong way, and I honestly don't know what approach to take.
I'm just determining lat/long values for addresses.
Once I have the lat/long values for the "items" I'm tracking, I will show those items on a map.

Isn't there another better way?

Thank you for any tips/advice/pointers.

Richard

答案1

得分: 1

以下是您提供的代码的翻译部分:

# 导入jsonlite和tidyverse库
library(jsonlite)
library(tidyverse)

# 将给定的字符串存储在变量中
json_string <- '{ "type" : "FeatureCollection", "name" : "ADDRESS_POINT_WGS84", "features" : [ { "type" : "Feature", "geometry" : { "type" : "Point", "coordinates" : [ -79.539538708, 43.586921565 ] }, "properties" : { "GEO_ID" : 5729529, "LINK" : 5729516, "MAINT_STAG" : "REGULAR", "ADDRESS" : "399", "LFNAME" : "Lake Promenade", "LO_NUM" : 399, "LONUMSUF" : null, "HINUM" : null, "HINUMSUF" : null, "ARC_SIDE" : "L", "DISTANCE" : 4.15, "FCODE" : 115001, "FCODE_DES" : "Unknown", "CLASS" : "Land", "NAME" : null, "X" : 301607.151, "Y" : 4827443.931, "LONGITUDE" : -79.5395419864, "LATITUDE" : 43.5869301259, "OBJECTID" : 1491585, "MUN_NAME" : "Etobicoke", "WARD_NAME" : "Etobicoke-Lakeshore" } } ] }'

# 将字符串转换为JSON对象
json_data <- jsonlite::fromJSON(json_string)
my_df <- json_data$features
my_df <- my_df %>%
  unnest_wider(col="properties")
names(my_df)
#>  [1] "type"       "geometry"   "GEO_ID"     "LINK"       "MAINT_STAG"
#>  [6] "ADDRESS"    "LFNAME"     "LO_NUM"     "LONUMSUF"  
#> [11] "HINUM"      "HINUMSUF"   "ARC_SIDE"   "DISTANCE"   "FCODE"     
#> [16] "FCODE_DES"  "CLASS"      "NAME"       "X"          "Y"         
#> [21] "LONGITUDE"  "LATITUDE"   "OBJECTID"   "MUN_NAME"   "WARD_NAME"

my_df %>%
  filter(ADDRESS=="399") %>%
  filter(LFNAME=="Lake Promenade")
#> # A tibble: 1 × 24
#>   type    geometry$type  GEO_ID   LINK MAINT_STAG ADDRESS LFNAME LO_NUM LONUMSUF
#>   <chr>   <chr>           <int>  <int> <chr>      <chr>   <chr>   <int> <lgl>   
#> 1 Feature Point         5729529 5.73e6 REGULAR    399     Lake …    399 NA      
#> # ℹ 16 more variables: geometry$coordinates <list>, HINUM <lgl>,
#> #   HINUMSUF <lgl>, ARC_SIDE <chr>, DISTANCE <dbl>, FCODE <int>,
#> #   FCODE_DES <chr>, CLASS <chr>, NAME <lgl>, X <dbl>, Y <dbl>,
#> #   LONGITUDE <dbl>, LATITUDE <dbl>, OBJECTID <int>, MUN_NAME <chr>,
#> #   WARD_NAME <chr>

希望这有帮助!如果您需要任何进一步的翻译,请随时告诉我。

英文:
library(jsonlite)
library(tidyverse)

# Store the given string in a variable
json_string &lt;- &#39;{ &quot;type&quot; : &quot;FeatureCollection&quot;, &quot;name&quot; : &quot;ADDRESS_POINT_WGS84&quot;, &quot;features&quot; : [ { &quot;type&quot; : &quot;Feature&quot;, &quot;geometry&quot; : { &quot;type&quot; : &quot;Point&quot;, &quot;coordinates&quot; : [ -79.539538708, 43.586921565 ] }, &quot;properties&quot; : { &quot;GEO_ID&quot; : 5729529, &quot;LINK&quot; : 5729516, &quot;MAINT_STAG&quot; : &quot;REGULAR&quot;, &quot;ADDRESS&quot; : &quot;399&quot;, &quot;LFNAME&quot; : &quot;Lake Promenade&quot;, &quot;LO_NUM&quot; : 399, &quot;LONUMSUF&quot; : null, &quot;HINUM&quot; : null, &quot;HINUMSUF&quot; : null, &quot;ARC_SIDE&quot; : &quot;L&quot;, &quot;DISTANCE&quot; : 4.15, &quot;FCODE&quot; : 115001, &quot;FCODE_DES&quot; : &quot;Unknown&quot;, &quot;CLASS&quot; : &quot;Land&quot;, &quot;NAME&quot; : null, &quot;X&quot; : 301607.151, &quot;Y&quot; : 4827443.931, &quot;LONGITUDE&quot; : -79.5395419864, &quot;LATITUDE&quot; : 43.5869301259, &quot;OBJECTID&quot; : 1491585, &quot;MUN_NAME&quot; : &quot;Etobicoke&quot;, &quot;WARD_NAME&quot; : &quot;Etobicoke-Lakeshore&quot; } } ] }&#39;

# Convert the string to a JSON object
json_data &lt;- jsonlite::fromJSON(json_string)
my_df &lt;- json_data$features
my_df &lt;- my_df %&gt;% 
  unnest_wider(col=&quot;properties&quot;)
names(my_df)
#&gt;  [1] &quot;type&quot;       &quot;geometry&quot;   &quot;GEO_ID&quot;     &quot;LINK&quot;       &quot;MAINT_STAG&quot;
#&gt;  [6] &quot;ADDRESS&quot;    &quot;LFNAME&quot;     &quot;LO_NUM&quot;     &quot;LONUMSUF&quot;   &quot;HINUM&quot;     
#&gt; [11] &quot;HINUMSUF&quot;   &quot;ARC_SIDE&quot;   &quot;DISTANCE&quot;   &quot;FCODE&quot;      &quot;FCODE_DES&quot; 
#&gt; [16] &quot;CLASS&quot;      &quot;NAME&quot;       &quot;X&quot;          &quot;Y&quot;          &quot;LONGITUDE&quot; 
#&gt; [21] &quot;LATITUDE&quot;   &quot;OBJECTID&quot;   &quot;MUN_NAME&quot;   &quot;WARD_NAME&quot;

my_df %&gt;% 
  filter(ADDRESS==&quot;399&quot;) %&gt;% 
  filter(LFNAME==&quot;Lake Promenade&quot;)
#&gt; # A tibble: 1 &#215; 24
#&gt;   type    geometry$type  GEO_ID   LINK MAINT_STAG ADDRESS LFNAME LO_NUM LONUMSUF
#&gt;   &lt;chr&gt;   &lt;chr&gt;           &lt;int&gt;  &lt;int&gt; &lt;chr&gt;      &lt;chr&gt;   &lt;chr&gt;   &lt;int&gt; &lt;lgl&gt;   
#&gt; 1 Feature Point         5729529 5.73e6 REGULAR    399     Lake …    399 NA      
#&gt; # ℹ 16 more variables: geometry$coordinates &lt;list&gt;, HINUM &lt;lgl&gt;,
#&gt; #   HINUMSUF &lt;lgl&gt;, ARC_SIDE &lt;chr&gt;, DISTANCE &lt;dbl&gt;, FCODE &lt;int&gt;,
#&gt; #   FCODE_DES &lt;chr&gt;, CLASS &lt;chr&gt;, NAME &lt;lgl&gt;, X &lt;dbl&gt;, Y &lt;dbl&gt;,
#&gt; #   LONGITUDE &lt;dbl&gt;, LATITUDE &lt;dbl&gt;, OBJECTID &lt;int&gt;, MUN_NAME &lt;chr&gt;,
#&gt; #   WARD_NAME &lt;chr&gt;

<sup>Created on 2023-05-10 with reprex v2.0.2</sup>

答案2

得分: 0

提供的文件示例是有效的 GeoJSON 文件,空间包(如 sf)可以直接处理它。sf 对象基本上是一个带有特殊几何列的数据框,因此您可以主要使用标准的数据整理工具和方法来处理属性,无论是基本 R 还是 dplyr / Tidyverse。如果您真的想要的话,也可以使用 sf 进行 SQL 查询(仅在读取文件时可用,因此每个查询都会从磁盘读取整个数据集到内存,至少在存储为 json 时是这样的)。或者使用 sqldf 包在数据框上使用 SQL,尽管这样您会错过对 sf 工具的直接访问 - https://r-spatial.github.io/sf/

library(sf)
library(dplyr)

# 读取示例数据集,保存为 sample.geojson
sample_sf <- st_read("sample.geojson", quiet = TRUE, as_tibble = TRUE)
sample_sf
#> Simple feature collection with 4 features and 22 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -79.54037 ymin: 43.58692 xmax: -79.38804 ymax: 43.68309
#> Geodetic CRS:  WGS 84
#> # A tibble: 4 × 23
#>    GEO_ID     LINK MAINT_STAG ADDRESS LFNAME      LO_NUM LONUMSUF HINUM HINUMSUF
#>     <int>    <int> <chr>      <chr>   <chr>        <int> <chr>    <chr> <chr>   
#> 1 5729529  5729516 REGULAR    399     Lake Prome…    399 <NA>     <NA>  <NA>    
#> 2 9950585  9950578 REGULAR    7       Hilo Rd          7 <NA>     <NA>  <NA>    
#> 3 5729496  5729468 REGULAR    387     Lake Prome…    387 <NA>     <NA>  <NA>    
#> 4  850466 20002604 REGULAR    1       Ottawa St        1 <NA>     <NA>  <NA>    
#> # ℹ 14 more variables: ARC_SIDE <chr>, DISTANCE <dbl>,
#> #   FCODE <int>, FCODE_DES <chr>, CLASS <chr>, NAME <chr>, X <dbl>, Y <dbl>,
#> #   LONGITUDE <dbl>, LATITUDE <dbl>, OBJECTID <int>, MUN_NAME <chr>,
#> #   WARD_NAME <chr>, geometry <POINT [°]>

# 使用 mapview 过滤和绘制结果
sample_sf %>%
  filter(ADDRESS == "7", LFNAME == "Hilo Rd") %>%
  mapview::mapview()

“R” 和 SQL 用于查询大型 JSON 数据

# 在空间数据集上使用 SQL 查询:
st_read("sample.geojson", quiet = TRUE, as_tibble = TRUE, 
        query = "select Latitude, Longitude From ADDRESS_POINT_WGS84 Where ADDRESS = '7'")
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -79.54037 ymin: 43.58826 xmax: -79.54037 ymax: 43.58826
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 3
#>   Latitude Longitude             geometry
#>      <dbl>     <dbl>          <POINT [°]>
#> 1     43.6     -79.5 (-79.54037 43.58826)

# 对于较大的数据集,您可能希望考虑比 json 更紧凑的存储选项,例如 GeoPackage,它实际上是一个 SQLite 文件。
st_write(sample_sf, "sample.gpkg") 

# 或者只是 R rds:
readr::write_rds(sample_sf, "sample.rds")

# 尽管对于这个小示例,生成的 gpkg 大小超过了 json 的大小:
list.files() %>%  file.info() %>% `[`(1)
#>                 size
#> sample.geojson  3537
#> sample.gpkg    98304
#> sample.rds      3411

# 读取时,只需替换文件名;sf 会根据扩展名猜测类型/驱动程序
sample_g <- st_read("sample.gpkg", quiet = TRUE, as_tibble = TRUE)

创建于2023-05-10,使用 reprex v2.0.2

英文:

Provided file sample is valid GeoJSON and spatial packages, like sf, can handle it directly. sf object is basically a dataframe with special geometry column(s), so you mostly use your standard data wrangling tools and methods to deal with attributes, be it base R or dplyr / Tidyverse. If you really want to, you can also use SQL queries with sf (only available when reading a file, so every single query reads the whole dataset from disk to memory, at least when it's stored as json). Or use sqldf package to use SQL on dataframes, though you'd then miss direct access to sf tools - https://r-spatial.github.io/sf/

library(sf)
library(dplyr)

# read sample dataset, saved as sample.geojson
sample_sf &lt;- st_read(&quot;sample.geojson&quot;, quiet = TRUE, as_tibble = TRUE)
sample_sf
#&gt; Simple feature collection with 4 features and 22 fields
#&gt; Geometry type: POINT
#&gt; Dimension:     XY
#&gt; Bounding box:  xmin: -79.54037 ymin: 43.58692 xmax: -79.38804 ymax: 43.68309
#&gt; Geodetic CRS:  WGS 84
#&gt; # A tibble: 4 &#215; 23
#&gt;    GEO_ID     LINK MAINT_STAG ADDRESS LFNAME      LO_NUM LONUMSUF HINUM HINUMSUF
#&gt;     &lt;int&gt;    &lt;int&gt; &lt;chr&gt;      &lt;chr&gt;   &lt;chr&gt;        &lt;int&gt; &lt;chr&gt;    &lt;chr&gt; &lt;chr&gt;   
#&gt; 1 5729529  5729516 REGULAR    399     Lake Prome…    399 &lt;NA&gt;     &lt;NA&gt;  &lt;NA&gt;    
#&gt; 2 9950585  9950578 REGULAR    7       Hilo Rd          7 &lt;NA&gt;     &lt;NA&gt;  &lt;NA&gt;    
#&gt; 3 5729496  5729468 REGULAR    387     Lake Prome…    387 &lt;NA&gt;     &lt;NA&gt;  &lt;NA&gt;    
#&gt; 4  850466 20002604 REGULAR    1       Ottawa St        1 &lt;NA&gt;     &lt;NA&gt;  &lt;NA&gt;    
#&gt; # ℹ 14 more variables: ARC_SIDE &lt;chr&gt;, DISTANCE &lt;dbl&gt;, FCODE &lt;int&gt;,
#&gt; #   FCODE_DES &lt;chr&gt;, CLASS &lt;chr&gt;, NAME &lt;chr&gt;, X &lt;dbl&gt;, Y &lt;dbl&gt;,
#&gt; #   LONGITUDE &lt;dbl&gt;, LATITUDE &lt;dbl&gt;, OBJECTID &lt;int&gt;, MUN_NAME &lt;chr&gt;,
#&gt; #   WARD_NAME &lt;chr&gt;, geometry &lt;POINT [&#176;]&gt;

# filter and plot results with mapview
sample_sf %&gt;%
  filter(ADDRESS == &quot;7&quot;,LFNAME == &quot;Hilo Rd&quot;) %&gt;%
  mapview::mapview()

“R” 和 SQL 用于查询大型 JSON 数据<!-- -->

# use SQL query on spatial dataset:  
st_read(&quot;sample.geojson&quot;, quiet = TRUE, as_tibble = TRUE, 
        query = &quot;select Latitude, Longitude From ADDRESS_POINT_WGS84 Where ADDRESS = &#39;7&#39;&quot;)
#&gt; Simple feature collection with 1 feature and 2 fields
#&gt; Geometry type: POINT
#&gt; Dimension:     XY
#&gt; Bounding box:  xmin: -79.54037 ymin: 43.58826 xmax: -79.54037 ymax: 43.58826
#&gt; Geodetic CRS:  WGS 84
#&gt; # A tibble: 1 &#215; 3
#&gt;   Latitude Longitude             geometry
#&gt;      &lt;dbl&gt;     &lt;dbl&gt;          &lt;POINT [&#176;]&gt;
#&gt; 1     43.6     -79.5 (-79.54037 43.58826)

# You might want to consider more compact storage options than json 
# for larger datasets, for example GeoPackage, which is actually an SQLite file.
st_write(sample_sf, &quot;sample.gpkg&quot;) 

# or just R rds:
readr::write_rds(sample_sf, &quot;sample.rds&quot;)

# Though for this small sample, the resulting gpkg size exceeds json&#39;s:
list.files() %&gt;%  file.info() %&gt;% `[`(1)
#&gt;                 size
#&gt; sample.geojson  3537
#&gt; sample.gpkg    98304
#&gt; sample.rds      3411

# For reading, just replace the filename; sf guesses the type / driver
# from extension
sample_g &lt;- st_read(&quot;sample.gpkg&quot;, quiet = TRUE, as_tibble = TRUE)

<sup>Created on 2023-05-10 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年5月10日 14:12:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76215357.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定