英文:
How to subsample a pandas df so that its variable distribution fits another distribution?
问题
我有两张天文数据表,df_jpas
和 df_gaia
。它们是包含星系红移 z
等信息的目录。我可以绘制这两个目录中红移的分布,如下所示:
现在我想创建一个子采样的 df_jpas
,使其在红移范围 0.8<z<2.3 内的分布尽可能接近于 df_gaia
的分布,即我想要:
我应该如何做?
英文:
I am having 2 astronomical data tables, df_jpas
and df_gaia
. They are catalogues of galaxies containing among others the red-shifts z
of the galaxies. I can plot the distribution of the redshifts of the 2 catalogs and it looks like this:
What I want now is to create a subsampled df_jpas
, so that its distribution of z
is as close as possible to the distribution of df_gaia
within the z-range 0.8<z<2.3, means I want:
How do I do this?
答案1
得分: 1
这是一个解决方案。
首先,让我们将数据框切割成所需的 z 范围:
left_z_edge, right_z_edge = 0.8, 2.3
stepsize=0.02
df_jpas = df_jpas[(df_jpas.z>left_z_edge)&(df_jpas.z<right_z_edge)]
df_gaia = df_gaia[(df_gaia.z>left_z_edge)&(df_gaia.z<right_z_edge)]
接下来,我们想要计算这些数据框的分布(或直方图):
jpas_hist, jpas_bin_edges = np.histogram(df_jpas.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
jpas_bin_centers = (jpas_bin_edges + stepsize/2)[:-1] # instead of using the bin edges I create the bin centers and use them later
gaia_hist, gaia_bin_edges = np.histogram(df_gaia.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
gaia_bin_centers = (gaia_bin_edges + stepsize/2)[:-1]
完成这一步后,代码的关键部分来了 - 将 gaia_hist
除以 jpas_hist
给出了在特定 z-bin 中存在银河系的概率,这个概率是我们将用于子采样的概率:
jpas_occup_prob = gaia_hist/jpas_hist
接下来,我们创建一个要应用于 df_jpas
数据框的函数,它创建了一个额外的列,其中包含一个标志,指示是否应该"激活"(保留或删除)该特定星系,以提供所需的分布:
def activate_QSO(z_val):
idx = (np.abs(jpas_bin_centers - z_val)).argmin() # find the closest desscrite z-value to the z of the current QSO
ocup_prob = jpas_occup_prob[idx] # assign to this entry the its probability of occupation
activation_flag = int(np.random.random() < ocup_prob)# either activate (1) or not (0) this QSO depending on the probability from above
return(activation_flag)
使用此标志,我们可以绘制在此列中包含 1
的所有星系,从而获得所需的分布:
plt.hist(df_jpas[df_jpas.activation_flag==1].z, bins=100, alpha=0.5, label='jpas mock, subsampled')
plt.hist(df_gaia.z, bins=100, alpha=0.5, label='GAIA QSO')
plt.ylabel('N(z)')
plt.xlabel('z')
plt.legend()
plt.show()
英文:
Here is a solution.
Let's first cut the dataframes into the desired z-range:
left_z_edge, right_z_edge = 0.8, 2.3
stepsize=0.02
df_jpas = df_jpas[(df_jpas.z>left_z_edge)&(df_jpas.z<right_z_edge)]
df_gaia = df_gaia[(df_gaia.z>left_z_edge)&(df_gaia.z<right_z_edge)]
Next, we want to calculate the distributions (or histograms) of these dataframes:
jpas_hist, jpas_bin_edges = np.histogram(df_jpas.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
jpas_bin_centers = (jpas_bin_edges + stepsize/2)[:-1] # instead of using the bin edges I create the bin centers and use them later
gaia_hist, gaia_bin_edges = np.histogram(df_gaia.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
gaia_bin_centers = (gaia_bin_edges + stepsize/2)[:-1]
After this is done comes the critical part of the code - dividing gaia_hist
by jpas_hist
gives us the probability of a galaxy existing in the particular z-bin and this probability is what we will use for subsampling:
jpas_occup_prob = gaia_hist/jpas_hist
Next, we create a function to be applied on the df_jpas
dataframe, it creates an additional column that contains a flag if this particular galaxy should be "activated" (dropped or remained) to provide the desired distribution:
def activate_QSO(z_val):
idx = (np.abs(jpas_bin_centers - z_val)).argmin() # find the closest desscrite z-value to the z of the current QSO
ocup_prob = jpas_occup_prob[idx] # assign to this entry the its probability of occupation
activation_flag = int(np.random.random() < ocup_prob)# either activate (1) or not (0) this QSO depending on the probability from above
return(activation_flag)
df_jpas['activation_flag'] = df_jpas['z'].apply(activate_QSO)
Using this flag, we can plot all galaxies containing 1
in this column which gives us the desired distribution:
plt.hist(df_jpas[df_jpas.activation_flag==1].z, bins=100, alpha=0.5, label='jpas mock, subsampled')
plt.hist(df_gaia.z, bins=100, alpha=0.5, label='GAIA QSO')
plt.ylabel('N(z)')
plt.xlabel('z')
plt.legend()
plt.show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论