2023年5月10日 18:10:03go评论115阅读模式

英文:

Create xarray Dataset with observations and averages that has combined index

问题

Here are the translated code parts:

假设我有以下包含不同位置随时间变化的观测数据的dataarray：

import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
    np.random.randint(1, 100, (36, 3)),
    dims=("time", "location"),
    coords={
        "time": pd.date_range("2022-01-01", periods=36, freq="10D"),
        "location": ["A", "B", "C"]
    },
    name="observations"
)

现在我计算月度平均值，并将其与观测数据合并成一个数据集：

monthly_avg = data.groupby("time.month").mean()
data = data.to_dataset()
data["average"] = monthly_avg

这将给我：

如何正确设置索引（如果可能的话），以便当我运行：

data.sel(time="2022-01-01")

我得到一个子集，其中包括一个时间、所有位置和一个月度平均值（对应所选时间段）？

目前当我运行这个时，我得到：

返回了该时间步长的所有月度平均值。

反之，当我运行：

data.sel(month=1)

我希望得到只包含在一月份的时间步长的子集。

英文:

Suppose I have the following dataarray containing observations for different locations over time:

import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
    np.random.randint(1,100, (36, 3)), 
    dims=(&quot;time&quot;, &quot;location&quot;), 
    coords={
        &quot;time&quot;: pd.date_range(&quot;2022-01-01&quot;, periods=36, freq=&quot;10D&quot;), 
        &quot;location&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;]
    },
    name=&quot;observations&quot;
)

and now I calculate the monthly average and combine it with the observations to a dataset:

monthly_avg = data.groupby(&quot;time.month&quot;).mean()
data = data.to_dataset()
data[&quot;average&quot;] = monthly_avg

giving me

How can is set the indices correctly (if possible) so when I run:

data.sel(time=&quot;2022-01-01&quot;)

I get a subset of the dataset for one time, all locations and one monthly average (which corresponds to the selected time)?

At the moment when I run this I get

returning all monthly averages for the timestep.

Conversely, when I run

data.sel(month=1)

I'd like a subset with only the timesteps that are in January.

答案1

得分: 2

为了获得您想要的选择结果，我首先会计算月度平均值，并重复它们以匹配原始的时间维度。然后，我会创建一个多级索引，以便您可以选择特定日期或月份。

#设置测试数据
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
    np.random.randint(1,100, (36, 3)), 
    dims=("time", "location"), 
    coords={
        "time": pd.date_range("2022-01-01", periods=36, freq="10D"), 
        "location": ["A", "B", "C"]
    },
    name="observations"
)
#计算月度数组并使用列表推导重复
data=data.to_dataset()
monthly_avg = data.groupby("time.month").mean()['observations'].values
data['average']=(('time','location'),np.array([monthly_avg[i-1,:] for i in data.time.dt.month]))
#添加月份并创建多级索引
data['month']=data.time.dt.month
data=data.set_index(day_month=['time','month'])

然后，您可以运行选择以获取您想要的结果。

print(data.sel(time="2022-01-01"))
<xarray.Dataset>
Dimensions:       (location: 3, month: 1)
Coordinates:
  * location      (location) <U1 'A' 'B' 'C'
  * month         (month) int64 1
    time          <U10 '2022-01-01'
Data variables:
    observations  (month, location) int64 52 93 15
    average       (month, location) float64 70.5 82.25 33.75

print(data.sel(month=1))
<xarray.Dataset>
Dimensions:       (location: 3, time: 4)
Coordinates:
  * location      (location) <U1 'A' 'B' 'C'
  * time          (time) datetime64[ns] 2022-01-01 2022-01-11 ... 2022-01-31
    month         int64 1
Data variables:
    observations  (time, location) int64 52 93 15 72 61 21 83 87 75 75 88 24
    average       (time, location) float64 70.5 82.25 33.75 ... 70.5 82.25 33.7

这会为第二个命令提供重复的值。

也许有更好的设置多级索引的方法。您可以查看pandas多级索引文档：https://pandas.pydata.org/docs/user_guide/advanced.html，
或者查看xarray中的stack/unstack文档：https://xarray.pydata.org/en/v0.7.2/reshaping.html#stack-and-unstack，
以防您之前没有这样做过。

英文:

To get the selection return what you want, I would first compute the monthly averages and repeat them to match the original time-dimension.
Then I would create a multi-index, such that you can select either the specific date or the month.

#setup test data 
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(42)
data = xr.DataArray(
    np.random.randint(1,100, (36, 3)), 
    dims=(&quot;time&quot;, &quot;location&quot;), 
    coords={
        &quot;time&quot;: pd.date_range(&quot;2022-01-01&quot;, periods=36, freq=&quot;10D&quot;), 
        &quot;location&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;]
    },
    name=&quot;observations&quot;
)
#compute monthly array and repeat with list comprehension
data=data.to_dataset()
monthly_avg = data.groupby(&quot;time.month&quot;).mean()[&#39;observations&#39;].values
data[&#39;average&#39;]=((&#39;time&#39;,&#39;location&#39;),np.array([monthly_avg[i-1,:] for i in data.time.dt.month]))
#add month  and create multiindex
data[&#39;month&#39;]=data.time.dt.month
data=data.set_index(day_month=[&#39;time&#39;,&#39;month&#39;])

You can then run the selection to get what you want.

print(data.sel(time=&quot;2022-01-01&quot;))
&lt;xarray.Dataset&gt;
Dimensions:       (location: 3, month: 1)
Coordinates:
  * location      (location) &lt;U1 &#39;A&#39; &#39;B&#39; &#39;C&#39;
  * month         (month) int64 1
    time          &lt;U10 &#39;2022-01-01&#39;
Data variables:
    observations  (month, location) int64 52 93 15
    average       (month, location) float64 70.5 82.25 33.75

print(data.sel(month=1))
&lt;xarray.Dataset&gt;
Dimensions:       (location: 3, time: 4)
Coordinates:
  * location      (location) &lt;U1 &#39;A&#39; &#39;B&#39; &#39;C&#39;
  * time          (time) datetime64[ns] 2022-01-01 2022-01-11 ... 2022-01-31
    month         int64 1
Data variables:
    observations  (time, location) int64 52 93 15 72 61 21 83 87 75 75 88 24
    average       (time, location) float64 70.5 82.25 33.75 ... 70.5 82.25 33.7

This gives repeated values for the second command.

Maybe there is a better way to set up the multi-index.
You can have a look at the pandas multiindex documentation: https://pandas.pydata.org/docs/user_guide/advanced.html:
or into stack/unstack in xarray: https://xarray.pydata.org/en/v0.7.2/reshaping.html#stack-and-unstack
in case you haven't done so before.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建带有观测和平均值的xarray数据集，该数据集具有合并的索引。

问题

答案1

能否将引用数据加入到pandas数据框中的嵌套字典？

如何使`cv2.HoughLinesP` 仅检测垂直线？

监控批处理作业由Prometheus。

我的 Y 轴在使用 Plotly 绘制热力图时没有显示所有日期。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。