英文:
Interpolating values in xarray using non-indexed coordinates
问题
我试图从Google ERA5 Reanalysis数据中获取地理坐标(单点)的时间序列。数据集如下:
import xarray
data = xarray.open_zarr(
    'gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr/',
    chunks={'time': 48},
    consolidated=True,
)
print("Model wind dataset size {:.1f} TiB".format(data.nbytes/(1024**4)))
print(data)
最佳的插值时间序列的方法是什么?
像.sel和interp这样的方法不起作用:
data['cape'].interp(dict(latitude=60, longitude=20))
ValueError: Dimensions {'longitude', 'latitude'} do not exist. Expected one or more of Frozen({'values': 542080, 'time': 374016})
英文:
I'm trying to fetch time series from geographical coordinates (single points) from Google ERA5 Reanalysis data. The dataset is following:
import xarray
data = xarray.open_zarr(
    'gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr/',
    chunks={'time': 48},
    consolidated=True,
)
print("Model wind dataset size {:.1f} TiB".format(data.nbytes/(1024**4)))
print(data)
Model wind dataset size 28.0 TiB
<xarray.Dataset>
Dimensions:              (time: 374016, values: 542080)
Coordinates:
    depthBelowLandLayer  float64 ...
    entireAtmosphere     float64 ...
    latitude             (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
    longitude            (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
    number               int64 ...
    step                 timedelta64[ns] ...
    surface              float64 ...
  * time                 (time) datetime64[ns] 1979-01-01 ... 2021-08-31T23:0...
    valid_time           (time) datetime64[ns] dask.array<chunksize=(48,), meta=np.ndarray>
Dimensions without coordinates: values
Data variables: (12/38)
    cape                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    d2m                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    hcc                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl1                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl2                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl3                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    ...                   ...
    tsn                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    u10                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    u100                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    v10                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    v100                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    z                    (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
Attributes:
    Conventions:               CF-1.7
    GRIB_centre:               ecmf
    GRIB_centreDescription:    European Centre for Medium-Range Weather Forec...
    GRIB_edition:              1
    GRIB_subCentre:            0
    history:                   2022-09-23T18:56 GRIB to CDM+CF via cfgrib-0.9...
    institution:               European Centre for Medium-Range Weather Forec...
    pangeo-forge:inputs_hash:  5f4378143e9f42402424280b63472752da3aa79179b53b...
    pangeo-forge:recipe_hash:  0c3415923e347ce9dac9dc5c6d209525f4d45d799bd25b...
    pangeo-forge:version:      0.9.1
What is the best way to interpolate a time series from single geographical point?
The methods like .sel and interp don't work:
data['cape'].interp(dict(latitude=60, longitude=20))
ValueError: Dimensions {'longitude', 'latitude'} do not exist. Expected one or more of Frozen({'values': 542080, 'time': 374016})
答案1
得分: 1
sel 和 interp 无法工作,因为数据不位于结构化、规则的网格上。如果绘制经度/纬度坐标的散点图,你将会得到:
你需要处理这些非结构化坐标。一种方法是使用最近邻值。你可以使用类似这样的代码来实现:
import numpy as np
import xarray as xr
from sklearn.neighbors import NearestNeighbors
class NearestInterpolator:
    def __init__(self, ds, x='longitude', y='latitude'):
        coords = np.c_[ds[x].values, ds[y].values]
        self.nn = NearestNeighbors().fit(coords)
    def interpolate(self, ds, coords, values='values'):
        index = self.nn.kneighbors(X=np.atleast_2d(coords), n_neighbors=1, return_distance=False).ravel()
        return ds.isel({values: index})
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr",
                      chunks={'time': 48},
                      consolidated=True)
ni = NearestInterpolator(ds=ds)
ds_interpolated = ni.interpolate(ds, [[2, 47], [5, 50]])
ds_interpolated['d2m']
这不是一个完美的代码,因为它不理解接近0经度和接近360经度的数据在空间上是相邻的,但它能工作。如果你想进一步使用线性插值,你需要计算Delaunay三角剖分,但这对于这542 080个坐标来说可能会很昂贵。
英文:
sel and interp won't work because the data are not located on a structured, regular grid. If you scatter plot the lon/lat coordinates, you will get :
You have to deal with these unstructured coordinates. One way is to take the nearest neighbor value. You can do it with this kind of code for instance :
import numpy as np
import xarray as xr
from sklearn.neighbors import NearestNeighbors
class NearestInterpolator:
    def __init__(self, ds, x='longitude', y='latitude'):
        coords = np.c_[ds[x].values, ds[y].values]
        self.nn = NearestNeighbors().fit(coords)
    def interpolate(self, ds, coords, values='values'):
        index = self.nn.kneighbors(X=np.atleast_2d(coords), n_neighbors=1, return_distance=False).ravel()
        return ds.isel({values: index})
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr",
                      chunks={'time': 48},
                      consolidated=True)
ni = NearestInterpolator(ds=ds)
ds_interpolated = ni.interpolate(ds, [[2, 47], [5, 50]])
ds_interpolated['d2m']
>>> <xarray.DataArray 'd2m' (time: 374016, values: 2)>
>>> dask.array<getitem, shape=(374016, 2), dtype=float32, chunksize=(48, 2), chunktype=numpy.ndarray>
>>> Coordinates:
>>>     depthBelowLandLayer  float64 ...
>>>     entireAtmosphere     float64 ...
>>>     latitude             (values) float64 dask.array<chunksize=(2,), meta=np.ndarray>
>>>     longitude            (values) float64 dask.array<chunksize=(2,), meta=np.ndarray>
>>>     number               int64 ...
>>>     step                 timedelta64[ns] ...
>>>     surface              float64 ...
>>>   * time                 (time) datetime64[ns] 1979-01-01 ... 2021-08-31T23:0...
>>>     valid_time           (time) datetime64[ns] dask.array<chunksize=(48,), meta=np.ndarray>
>>> Dimensions without coordinates: values
This is not a perfect code, since it doesn't understand that near 0 longitude data and near 360 data are spatially close, but it works. If you want to go further with linear interpolation, you would have to compute a Delaunay triangulation, which can be expensive for these 542 080 coordinates.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论