Python – 将具有列 x、y 和变量 “A” 的数据框转换为 netCDF 文件

huangapple go评论66阅读模式
英文:

Python - Converting a dataframe with columns x, y and a variable "A" into a netCDF file

问题

我的(简化的)数据结构如下:

x = [1,1,2,2,3,3,4,4,...n,n]

y = [1,2,1,2,1,2,1,2,...1,2]

A = [7,5,6,5,4,6,2,5,...4,3]

"A" 是一个与坐标 x 和 y 关联的变量。数据框包含三列。变量最初是从上到下读取的。从 x = 1 和 y = 1 开始,向下移动到 y = 最大值,然后 x = 2,y 从 1 移动到 y_max,依此类推。因此,这是二维数据,"变量 A" 的每个值在我的数据框中的同一行具有 x 和 y 的坐标值。

然而,当我直接将其转换为 netCDF 时:

Data.to_netcdf("filename.nc")

我得到大量的 x 和 y 变量(维度最终成为从 1 到 n 的索引)。例如,如果我的 x 坐标从 1 到 5,如 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,那么 netCDF 将有 15 个 x 坐标,而我希望只有 5 个。y 坐标也是相同的情况。我尝试了许多其他方法,但最终没有得到有用的结果。

我希望得到一个 netCDF,其中 "A" 是一个变量,而 x 和 y 是维度,但它们不会多次重复。我的真实数据集有超过一百个 x 值和将近一百个 y 值。因此,每个 x 值都会重复 y 次,反之亦然。

编辑:

这是回答者 @mozway 请求的原始代码:

import pandas as pd

S_2017 = pd.read_csv("S_2017.csv")

EachValue = []
for i in range(124):
    Lon_min = 19.3 + i*0.1
    Lon_max = Lon_min + 0.1
    for j in range(45):
        S_2017_Analyze = S_2017
        Lat_max = 64.2 - j*0.1
        Lat_min = Lat_max - 0.1
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] >= Lon_min]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] <= Lon_max]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] >= Lat_min]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] <= Lat_max]
        S_Sum_2017 = S_2017_Analyze.iloc[:,3].sum()
        Pixel_S_2017 = [round(Lat_min,2),round(Lon_min,2),S_Sum_2017]
        EachValue.append(Pixel_S_2017)
DataFrame = pd.DataFrame(EachValue,columns=["Latitude","Longitude","S_Sum_2017"])

这是 @mozway 提供的解决方案,我已经应用:

import xarray as xr

S_2017 = pd.DataFrame({'Lat':S_2017.iloc[:,0]
                       'Lon':S_2017.iloc[:,1]
                       'Variable':S_2017.iloc[:,2]
                       })
xr.Dataset.from_dataframe(S_2017.set_index(["Latitude","Longitude"])).to_netcdf("S_2017.nc")

如果你需要更多的帮助,请随时告诉我。

英文:

My (simplified) data structure is as follows:

> x = [1,1,2,2,3,3,4,4,...n,n]

> y = [1,2,1,2,1,2,1,2,...1,2]

> A = [7,5,6,5,4,6,2,5,...4,3]

"A" is a variable which is linked to coordinates x and y. Dataframe consists of three columns. The variables are being read originally top down. Starting with x = 1 and y = 1, going down to y = max and after that x = 2, y from 1 to y_max -> next x = 3 and so on. So, this is 2 dimensional data, each value of "variable A" has a coordinate value of x and y in the same row in my dataframe.

However when I convert this directly to netCDF with

> Data.to_netcdf("filename.nc")

I get massive amount of x and y variables (dimension ends up being an index from 1 to n). For example if my x coordinate goes from 1 to 5 like 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5 the netCDF will have 15 x -coordinates while I would like it to only have 5 of them. And same happens with the y -coordinates. I have tried many other approaches but I do not end up with anything useful.

I would like to have a netCDF with "A" as a variable and x and y as dimensions without them being repeated multiple times. My real dataset has more than a hundred x values and nearly a hundred y values. So every x value is repeated y times and vice versa.

Edit:

Here was the original code as requested by the answer giver @mozway

import pandas as pd

S_2017 = pd.read_csv(&quot;S_2017.csv&quot;)

EachValue = []
for i in range(124):
    Lon_min = 19.3 + i*0.1
    Lon_max = Lon_min + 0.1
    for j in range(45):
        S_2017_Analyze = S_2017
        Lat_max = 64.2 - j*0.1
        Lat_min = Lat_max - 0.1
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] &gt;= Lon_min]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] &lt;= Lon_max]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] &gt;= Lat_min]
        S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] &lt;= Lat_max]
        S_Sum_2017 = S_2017_Analyze.iloc[:,3].sum()
        Pixel_S_2017 = [round(Lat_min,2),round(Lon_min,2),S_Sum_2017]
        EachValue.append(Pixel_S_2017)
DataFrame = pd.DataFrame(EachValue,columns=[&quot;Latitude&quot;,&quot;Longitude&quot;,&quot;S_Sum_2017&quot;])

And here is the solution by @mozway which I applied

import xarray as xr 

S_2017 = pd.DataFrame({&#39;Lat&#39;:S_2017.iloc[:,0]
                       &#39;Lon&#39;:S_2017.iloc[:,1]
                       &#39;Variable&#39;:S_2017.iloc[:,2]
                       })
xr.Dataset.from_dataframe(S_2017.set_index([&quot;Latitude&quot;,&quot;Longitude&quot;])).to_netcdf(&quot;S_2017.nc&quot;)

答案1

得分: 1

IIUC,您可以将x/y设置为索引,将其转换为xarray,然后再转换为netCDF:

import pandas as pd
import xarray as xr

df = pd.DataFrame({'x': [1,1,2,2,3,3,4,4],
                   'y': [1,2,1,2,1,2,1,2],
                   'A': [7,5,6,5,4,6,2,5],
                   })

xr.Dataset.from_dataframe(df.set_index(['x', 'y'])).to_netcdf('filename.nc')

数据集(Dataset):

<xarray.Dataset>
Dimensions:  (x: 4, y: 2)
Coordinates:
  * x        (x) int32 1 2 3 4
  * y        (y) int32 1 2
Data variables:
    A        (x, y) int32 ...

底层的A

array([[7, 5],
       [6, 5],
       [4, 6],
       [2, 5]])
英文:

IIUC, you could set the x/y as index, convert to xarray and then to netCDF:

import pandas as pd
import xarray as xr

df = pd.DataFrame({&#39;x&#39;: [1,1,2,2,3,3,4,4],
                   &#39;y&#39;: [1,2,1,2,1,2,1,2],
                   &#39;A&#39;: [7,5,6,5,4,6,2,5],
                   })

xr.Dataset.from_dataframe(df.set_index([&#39;x&#39;, &#39;y&#39;])).to_netcdf(&#39;filename.nc&#39;)

Dataset:

&lt;xarray.Dataset&gt;
Dimensions:  (x: 4, y: 2)
Coordinates:
  * x        (x) int32 1 2 3 4
  * y        (y) int32 1 2
Data variables:
    A        (x, y) int32 ...

Underlying A:

array([[7, 5],
       [6, 5],
       [4, 6],
       [2, 5]])

huangapple
  • 本文由 发表于 2023年7月13日 19:27:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678861.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定