英文:
Python - Converting a dataframe with columns x, y and a variable "A" into a netCDF file
问题
我的(简化的)数据结构如下:
x = [1,1,2,2,3,3,4,4,...n,n]
y = [1,2,1,2,1,2,1,2,...1,2]
A = [7,5,6,5,4,6,2,5,...4,3]
"A" 是一个与坐标 x 和 y 关联的变量。数据框包含三列。变量最初是从上到下读取的。从 x = 1 和 y = 1 开始,向下移动到 y = 最大值,然后 x = 2,y 从 1 移动到 y_max,依此类推。因此,这是二维数据,"变量 A" 的每个值在我的数据框中的同一行具有 x 和 y 的坐标值。
然而,当我直接将其转换为 netCDF 时:
Data.to_netcdf("filename.nc")
我得到大量的 x 和 y 变量(维度最终成为从 1 到 n 的索引)。例如,如果我的 x 坐标从 1 到 5,如 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,那么 netCDF 将有 15 个 x 坐标,而我希望只有 5 个。y 坐标也是相同的情况。我尝试了许多其他方法,但最终没有得到有用的结果。
我希望得到一个 netCDF,其中 "A" 是一个变量,而 x 和 y 是维度,但它们不会多次重复。我的真实数据集有超过一百个 x 值和将近一百个 y 值。因此,每个 x 值都会重复 y 次,反之亦然。
编辑:
这是回答者 @mozway 请求的原始代码:
import pandas as pd
S_2017 = pd.read_csv("S_2017.csv")
EachValue = []
for i in range(124):
Lon_min = 19.3 + i*0.1
Lon_max = Lon_min + 0.1
for j in range(45):
S_2017_Analyze = S_2017
Lat_max = 64.2 - j*0.1
Lat_min = Lat_max - 0.1
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] >= Lon_min]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] <= Lon_max]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] >= Lat_min]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] <= Lat_max]
S_Sum_2017 = S_2017_Analyze.iloc[:,3].sum()
Pixel_S_2017 = [round(Lat_min,2),round(Lon_min,2),S_Sum_2017]
EachValue.append(Pixel_S_2017)
DataFrame = pd.DataFrame(EachValue,columns=["Latitude","Longitude","S_Sum_2017"])
这是 @mozway 提供的解决方案,我已经应用:
import xarray as xr
S_2017 = pd.DataFrame({'Lat':S_2017.iloc[:,0]
'Lon':S_2017.iloc[:,1]
'Variable':S_2017.iloc[:,2]
})
xr.Dataset.from_dataframe(S_2017.set_index(["Latitude","Longitude"])).to_netcdf("S_2017.nc")
如果你需要更多的帮助,请随时告诉我。
英文:
My (simplified) data structure is as follows:
> x = [1,1,2,2,3,3,4,4,...n,n]
> y = [1,2,1,2,1,2,1,2,...1,2]
> A = [7,5,6,5,4,6,2,5,...4,3]
"A" is a variable which is linked to coordinates x and y. Dataframe consists of three columns. The variables are being read originally top down. Starting with x = 1 and y = 1, going down to y = max and after that x = 2, y from 1 to y_max -> next x = 3 and so on. So, this is 2 dimensional data, each value of "variable A" has a coordinate value of x and y in the same row in my dataframe.
However when I convert this directly to netCDF with
> Data.to_netcdf("filename.nc")
I get massive amount of x and y variables (dimension ends up being an index from 1 to n). For example if my x coordinate goes from 1 to 5 like 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5 the netCDF will have 15 x -coordinates while I would like it to only have 5 of them. And same happens with the y -coordinates. I have tried many other approaches but I do not end up with anything useful.
I would like to have a netCDF with "A" as a variable and x and y as dimensions without them being repeated multiple times. My real dataset has more than a hundred x values and nearly a hundred y values. So every x value is repeated y times and vice versa.
Edit:
Here was the original code as requested by the answer giver @mozway
import pandas as pd
S_2017 = pd.read_csv("S_2017.csv")
EachValue = []
for i in range(124):
Lon_min = 19.3 + i*0.1
Lon_max = Lon_min + 0.1
for j in range(45):
S_2017_Analyze = S_2017
Lat_max = 64.2 - j*0.1
Lat_min = Lat_max - 0.1
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] >= Lon_min]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,1] <= Lon_max]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] >= Lat_min]
S_2017_Analyze = S_2017_Analyze[S_2017_Analyze.iloc[:,2] <= Lat_max]
S_Sum_2017 = S_2017_Analyze.iloc[:,3].sum()
Pixel_S_2017 = [round(Lat_min,2),round(Lon_min,2),S_Sum_2017]
EachValue.append(Pixel_S_2017)
DataFrame = pd.DataFrame(EachValue,columns=["Latitude","Longitude","S_Sum_2017"])
And here is the solution by @mozway which I applied
import xarray as xr
S_2017 = pd.DataFrame({'Lat':S_2017.iloc[:,0]
'Lon':S_2017.iloc[:,1]
'Variable':S_2017.iloc[:,2]
})
xr.Dataset.from_dataframe(S_2017.set_index(["Latitude","Longitude"])).to_netcdf("S_2017.nc")
答案1
得分: 1
IIUC,您可以将x/y设置为索引,将其转换为xarray
,然后再转换为netCDF:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1,1,2,2,3,3,4,4],
'y': [1,2,1,2,1,2,1,2],
'A': [7,5,6,5,4,6,2,5],
})
xr.Dataset.from_dataframe(df.set_index(['x', 'y'])).to_netcdf('filename.nc')
数据集(Dataset):
<xarray.Dataset>
Dimensions: (x: 4, y: 2)
Coordinates:
* x (x) int32 1 2 3 4
* y (y) int32 1 2
Data variables:
A (x, y) int32 ...
底层的A
:
array([[7, 5],
[6, 5],
[4, 6],
[2, 5]])
英文:
IIUC, you could set the x/y as index, convert to xarray
and then to netCDF:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1,1,2,2,3,3,4,4],
'y': [1,2,1,2,1,2,1,2],
'A': [7,5,6,5,4,6,2,5],
})
xr.Dataset.from_dataframe(df.set_index(['x', 'y'])).to_netcdf('filename.nc')
Dataset:
<xarray.Dataset>
Dimensions: (x: 4, y: 2)
Coordinates:
* x (x) int32 1 2 3 4
* y (y) int32 1 2
Data variables:
A (x, y) int32 ...
Underlying A
:
array([[7, 5],
[6, 5],
[4, 6],
[2, 5]])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论