英文:
Create xarray Dataset with observations and averages that has combined index
问题
Here are the translated code parts:
假设我有以下包含不同位置随时间变化的观测数据的dataarray:
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
np.random.randint(1, 100, (36, 3)),
dims=("time", "location"),
coords={
"time": pd.date_range("2022-01-01", periods=36, freq="10D"),
"location": ["A", "B", "C"]
},
name="observations"
)
现在我计算月度平均值,并将其与观测数据合并成一个数据集:
monthly_avg = data.groupby("time.month").mean()
data = data.to_dataset()
data["average"] = monthly_avg
这将给我:
如何正确设置索引(如果可能的话),以便当我运行:
data.sel(time="2022-01-01")
我得到一个子集,其中包括一个时间、所有位置和一个月度平均值(对应所选时间段)?
目前当我运行这个时,我得到:
返回了该时间步长的所有月度平均值。
反之,当我运行:
data.sel(month=1)
我希望得到只包含在一月份的时间步长的子集。
英文:
Suppose I have the following dataarray containing observations for different locations over time:
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
np.random.randint(1,100, (36, 3)),
dims=("time", "location"),
coords={
"time": pd.date_range("2022-01-01", periods=36, freq="10D"),
"location": ["A", "B", "C"]
},
name="observations"
)
and now I calculate the monthly average and combine it with the observations to a dataset:
monthly_avg = data.groupby("time.month").mean()
data = data.to_dataset()
data["average"] = monthly_avg
giving me
How can is set the indices correctly (if possible) so when I run:
data.sel(time="2022-01-01")
I get a subset of the dataset for one time, all locations and one monthly average (which corresponds to the selected time)?
At the moment when I run this I get
returning all monthly averages for the timestep.
Conversely, when I run
data.sel(month=1)
I'd like a subset with only the timesteps that are in January.
答案1
得分: 2
为了获得您想要的选择结果,我首先会计算月度平均值,并重复它们以匹配原始的时间维度。然后,我会创建一个多级索引,以便您可以选择特定日期或月份。
#设置测试数据
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
np.random.randint(1,100, (36, 3)),
dims=("time", "location"),
coords={
"time": pd.date_range("2022-01-01", periods=36, freq="10D"),
"location": ["A", "B", "C"]
},
name="observations"
)
#计算月度数组并使用列表推导重复
data=data.to_dataset()
monthly_avg = data.groupby("time.month").mean()['observations'].values
data['average']=(('time','location'),np.array([monthly_avg[i-1,:] for i in data.time.dt.month]))
#添加月份并创建多级索引
data['month']=data.time.dt.month
data=data.set_index(day_month=['time','month'])
然后,您可以运行选择以获取您想要的结果。
print(data.sel(time="2022-01-01"))
<xarray.Dataset>
Dimensions: (location: 3, month: 1)
Coordinates:
* location (location) <U1 'A' 'B' 'C'
* month (month) int64 1
time <U10 '2022-01-01'
Data variables:
observations (month, location) int64 52 93 15
average (month, location) float64 70.5 82.25 33.75
print(data.sel(month=1))
<xarray.Dataset>
Dimensions: (location: 3, time: 4)
Coordinates:
* location (location) <U1 'A' 'B' 'C'
* time (time) datetime64[ns] 2022-01-01 2022-01-11 ... 2022-01-31
month int64 1
Data variables:
observations (time, location) int64 52 93 15 72 61 21 83 87 75 75 88 24
average (time, location) float64 70.5 82.25 33.75 ... 70.5 82.25 33.7
这会为第二个命令提供重复的值。
也许有更好的设置多级索引的方法。您可以查看pandas多级索引文档:https://pandas.pydata.org/docs/user_guide/advanced.html,
或者查看xarray中的stack/unstack文档:https://xarray.pydata.org/en/v0.7.2/reshaping.html#stack-and-unstack,
以防您之前没有这样做过。
英文:
To get the selection return what you want, I would first compute the monthly averages and repeat them to match the original time-dimension.
Then I would create a multi-index, such that you can select either the specific date or the month.
#setup test data
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
np.random.randint(1,100, (36, 3)),
dims=("time", "location"),
coords={
"time": pd.date_range("2022-01-01", periods=36, freq="10D"),
"location": ["A", "B", "C"]
},
name="observations"
)
#compute monthly array and repeat with list comprehension
data=data.to_dataset()
monthly_avg = data.groupby("time.month").mean()['observations'].values
data['average']=(('time','location'),np.array([monthly_avg[i-1,:] for i in data.time.dt.month]))
#add month and create multiindex
data['month']=data.time.dt.month
data=data.set_index(day_month=['time','month'])
You can then run the selection to get what you want.
print(data.sel(time="2022-01-01"))
<xarray.Dataset>
Dimensions: (location: 3, month: 1)
Coordinates:
* location (location) <U1 'A' 'B' 'C'
* month (month) int64 1
time <U10 '2022-01-01'
Data variables:
observations (month, location) int64 52 93 15
average (month, location) float64 70.5 82.25 33.75
print(data.sel(month=1))
<xarray.Dataset>
Dimensions: (location: 3, time: 4)
Coordinates:
* location (location) <U1 'A' 'B' 'C'
* time (time) datetime64[ns] 2022-01-01 2022-01-11 ... 2022-01-31
month int64 1
Data variables:
observations (time, location) int64 52 93 15 72 61 21 83 87 75 75 88 24
average (time, location) float64 70.5 82.25 33.75 ... 70.5 82.25 33.7
This gives repeated values for the second command.
Maybe there is a better way to set up the multi-index.
You can have a look at the pandas multiindex documentation: https://pandas.pydata.org/docs/user_guide/advanced.html:
or into stack/unstack in xarray: https://xarray.pydata.org/en/v0.7.2/reshaping.html#stack-and-unstack
in case you haven't done so before.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论