英文:
SciPy's binned_statistic_2d returning 3d statistic
问题
我正在使用SciPy的binned_statistic_2d()
函数来确定XArray Dataset
中数据的分箱平均值。我的用法如下所示:
LTSbins = list(cloudcorrLTSw.LTS.values)[::5]
OMEGAbins = list(cloudcorrLTSw.OMEGA_700.values)[::5]
teststat = binned_statistic_2d(cloudcorrLTSw.LTS, cloudcorrLTSw.OMEGA_700, cloudcorrLTSw.CLD_RHO,
statistic=np.nanmean,bins=[LTSbins, OMEGAbins])
其中cloudcorrLTSw.LTS
,cloudcorrLTSw.OMEGA_700
和cloudcorrLTSw.CLD_RHO
是138个条目的向量,前两个是xarray数据集的维度,后一个是数据集中的LTS-OMEGA_700空间上的变量。LTS、OMEGA_700和CLD_RHO数据本身来自空间数据的时间序列,并且已经转换为它们当前的状态,因此每行和每列的NaN值很多,与坐标值一样多。这就是为什么使用np.nanmean作为统计量而不是内置的均值函数的原因。
根据SciPy文档页面上的说明,它应该返回一个形状为(nx, ny)的ndarray,其中x和y由binned_statistic_2d
函数调用中的bins
参数确定。然而,我得到的是一个138x27x27(27x27是因为所描述的分箱导致了28个值),其中第一个维度完全由NaN填充。因此,我必须再次通过np.nanmean
传递teststat.statistic
以去除多余的维度,这个操作不会花费太多时间,但让我担心是否会干扰数据。也许我应该对稀疏数据进行插值,这样绘图就不会像这样了,
但这是另一个问题。所以,在解决这个问题之前,binned_statistic_2d()
的输出是否是预期的?
根据some3128的建议更新的图像:
根据some3128使用statistic = 'count'
的答案更新的图像:
我使用的数据,我通过使用LTS和omega作为坐标,CLD_RHO作为这些轴上的变量来将其“强制”到LTS-omega空间中。通过使用原始数据(在创建新数据集之前),我能够得到这个图像,它看起来更合理。我想我可以将这个问题视为解决。
英文:
I am using SciPy's binned_statistic_2d()
function to determine the binned averages of data in an XArray Dataset
. My usage is shown below:
LTSbins = list(cloudcorrLTSw.LTS.values)[::5]
OMEGAbins = list(cloudcorrLTSw.OMEGA_700.values)[::5]
teststat = binned_statistic_2d(cloudcorrLTSw.LTS, cloudcorrLTSw.OMEGA_700, cloudcorrLTSw.CLD_RHO,
statistic=np.nanmean,bins=[LTSbins, OMEGAbins])
where cloudcorrLTSw.LTS
, cloudcorrLTSw.OMEGA_700
, and cloudcorrLTSw.CLD_RHO
are 138 entry vectors, the former two being dimensions of the xarray dataset and the latter being a variable in the dataset on LTS-OMEGA_700 space. The LTS, OMEGA_700, and CLD_RHO data are themselves from a timeseries developed from spatial data, and have been transformed into their current state, so there are, as a result, as many CLD_RHO values as there are of the coordinate values, meaning the data is sparse with many NaNs per row and column. This is the reasoning behind using np.nanmean as the statistic instead of the inbuilt mean function.
Per the SciPy documentation page for this function, it is intended to return (among other things) an ndarray of shape (nx, ny), where x and y are determined by the bins
kwarg in the binned_statistic_2d
function call. What I'm getting, however, is a 138x27x27 (27x27 because the described binning results in 28 values), with the first dimension being completely full of NaNs. As a result, I must pass teststat.statistic
through np.nanmean
again to remove that surplus dimension, an operation that doesn't take much time but makes me nervous about interfering with the data. It's probably wise of me to interpolate the sparse data so that the plot doesn't look like this,
but that's another question in and of itself. So, before I can solve that, is this the intended output of binned_statistic_2d()
?
Updated image to follow some3128's suggestion:
Updated image following some3128's answer using statistic = 'count'
:
The data I was using, I had 'forced' to be in LTS-omega space by creating an xarray object using LTS and omega as the coordinates and CLD_RHO as the variable on those axes. By using the original data (pre- making a new dataset), I was able to get this image, which seems much more reasonable. I think I can count this as solved.
答案1
得分: 1
看起来你可能遇到了一个问题,要么是你使用函数的方式有问题,要么是由于输入数据的特性导致的。一个可能的问题是与你使用LTSbins和OMEGAbins列表定义的bin有关。值得确认的是,这些列表是否准确地表示了你打算使用的bin边界。
此外,最好验证cloudcorrLTSw.LTS、cloudcorrLTSw.OMEGA_700和cloudcorrLTSw.CLD_RHO数组是否具有正确的形状,并且与你尝试实现的binning策略相适应。如果其中任何一个数组具有意外的形状或包含NaN值,可能会导致binned_statistic_2d函数中出现意外行为。这类问题有时会导致结果与你最初的预期不完全一致。
英文:
It looks like you might be facing an issue either in the way you're utilizing the function or due to the characteristics of your input data. One possible culprit could be related to how you've defined the bins using the LTSbins and OMEGAbins lists. It's worth confirming that these lists accurately represent the bin edges you intend to work with.
Furthermore, it's a good idea to verify that the arrays cloudcorrLTSw.LTS, cloudcorrLTSw.OMEGA_700, and cloudcorrLTSw.CLD_RHO possess the correct shapes and are appropriately aligned with the binning strategy you're attempting to implement. If any of these arrays have unexpected shapes or contain NaN values, they could potentially be causing the unexpected behavior you're seeing in the binned_statistic_2d function. Issues like these can sometimes lead to outcomes that don't quite match your initial expectations.
答案2
得分: 1
我合成了一些测试数据(rho 中有 50% 的 NaN 值),并将其通过分箱函数运行。以下是结果和代码。
数据:
使用热图可视化的分箱函数输出:
当我对数据进行分箱时,将统计量设置为 "count"
。这对每个分箱中的观测数量进行了一次合理性检查。分箱函数还返回了分箱的边界值,我将其叠加在了图上。颜色反映了散点图上原始数据的密度,所以看起来是有意义的。
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
#模拟数据
n_pts = 138
lts1 = np.random.uniform(-0.4, 1, n_pts // 2)
lts2 = np.random.randn(n_pts // 2) * 0.1 - 0.1
lts = np.empty(n_pts)
lts[::2] = lts1
lts[1::2] = lts2
hpa = np.random.uniform(-0.001, 0.0015, n_pts)
#rho 中有 50% 的 NaN 值
rho = np.random.randn(n_pts) + np.where(np.random.uniform(size=n_pts) > 0.5, np.nan, 0)
#查看数据
plt.scatter(lts, hpa, c=rho, label='rho')
plt.ylabel('hpa')
plt.xlabel('lts')
plt.legend()
plt.show()
import seaborn as sns
import scipy
#二维分箱
#hpa 是“第一维度”(行索引),函数将其称为“x”
#在图上,实际上是 y 轴
statistic, hpa_edge, lts_edge, binnumber = scipy.stats.binned_statistic_2d(
hpa, lts, rho, statistic='count', bins=9
)
#绘图
sns.heatmap(statistic, annot=True,
xticklabels=np.round(lts_edge, 1),
yticklabels=np.round(hpa_edge * 1e3, 1),
cmap='plasma')
plt.gca().set_ylabel('hpa * 1e3 bin')
plt.gca().set_xlabel('lts bin')
plt.gca().invert_yaxis()
我认为值得尝试将你的数据通过上述步骤运行一遍。
英文:
I synthesised some test data (with 50% NaNs in rho) and ran it through the binning function. Results and code below.
Data:
Output from binning function visualised using a heatmap:
When I binned the data, I set the statistic to "count"
. This provides a sanity check on the number of observations in each bracket. The binning function also returns the edge values of the bins, which I've overlaid onto the plot. The colouring reflects the density of the raw data on the scatter plot, so it looks like it's making sense.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
#Mock data
n_pts = 138
lts1 = np.random.uniform(-0.4, 1, n_pts // 2)
lts2 = np.random.randn(n_pts // 2) * 0.1 - 0.1
lts = np.empty(n_pts)
lts[::2] = lts1
lts[1::2] = lts2
hpa = np.random.uniform(-0.001, 0.0015, n_pts)
#rho has 50% NaNs
rho = np.random.randn(n_pts) + np.where(np.random.uniform(size=n_pts) > 0.5, np.nan, 0)
#View the data
plt.scatter(lts, hpa, c=rho, label='rho')
plt.ylabel('hpa')
plt.xlabel('lts')
plt.legend()
plt.show()
import seaborn as sns
import scipy
#2D binning
#hpa is the "first dimension" (row index), which the function refers to as "x"
# on the plots, this is actually the y axis
statistic, hpa_edge, lts_edge, binnumber = scipy.stats.binned_statistic_2d(
hpa, lts, rho, statistic='count', bins=9
)
#Plot
sns.heatmap(statistic, annot=True,
xticklabels=np.round(lts_edge, 1),
yticklabels=np.round(hpa_edge * 1e3, 1),
cmap='plasma')
plt.gca().set_ylabel('hpa * 1e3 bin')
plt.gca().set_xlabel('lts bin')
plt.gca().invert_yaxis()
I think it'd be worth seeing if you can run your data through the steps above.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论