2023年5月21日 00:54:56go评论68阅读模式

英文:

How to subsample a pandas df so that its variable distribution fits another distribution?

问题

我有两张天文数据表，df_jpas 和 df_gaia。它们是包含星系红移 z 等信息的目录。我可以绘制这两个目录中红移的分布，如下所示：

现在我想创建一个子采样的 df_jpas，使其在红移范围 0.8<z<2.3 内的分布尽可能接近于 df_gaia 的分布，即我想要：

我应该如何做？

英文:

I am having 2 astronomical data tables, df_jpas and df_gaia. They are catalogues of galaxies containing among others the red-shifts z of the galaxies. I can plot the distribution of the redshifts of the 2 catalogs and it looks like this:

What I want now is to create a subsampled df_jpas, so that its distribution of z is as close as possible to the distribution of df_gaia within the z-range 0.8<z<2.3, means I want:

How do I do this?

答案1

得分: 1

这是一个解决方案。

首先，让我们将数据框切割成所需的 z 范围：

left_z_edge, right_z_edge = 0.8, 2.3
stepsize=0.02

df_jpas = df_jpas[(df_jpas.z>left_z_edge)&(df_jpas.z<right_z_edge)]
df_gaia = df_gaia[(df_gaia.z>left_z_edge)&(df_gaia.z<right_z_edge)]

接下来，我们想要计算这些数据框的分布（或直方图）：

jpas_hist, jpas_bin_edges = np.histogram(df_jpas.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
jpas_bin_centers = (jpas_bin_edges + stepsize/2)[:-1] # instead of using the bin edges I create the bin centers and use them later

gaia_hist, gaia_bin_edges = np.histogram(df_gaia.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
gaia_bin_centers = (gaia_bin_edges + stepsize/2)[:-1]

完成这一步后，代码的关键部分来了 - 将 gaia_hist 除以 jpas_hist 给出了在特定 z-bin 中存在银河系的概率，这个概率是我们将用于子采样的概率：

jpas_occup_prob = gaia_hist/jpas_hist

接下来，我们创建一个要应用于 df_jpas 数据框的函数，它创建了一个额外的列，其中包含一个标志，指示是否应该"激活"（保留或删除）该特定星系，以提供所需的分布：

def activate_QSO(z_val):
    idx = (np.abs(jpas_bin_centers - z_val)).argmin() # find the closest desscrite z-value to the z of the current QSO
    ocup_prob = jpas_occup_prob[idx] # assign to this entry the its probability of occupation
    activation_flag = int(np.random.random() < ocup_prob)# either activate (1) or not (0) this QSO depending on the probability from above
    return(activation_flag)

使用此标志，我们可以绘制在此列中包含 1 的所有星系，从而获得所需的分布：

plt.hist(df_jpas[df_jpas.activation_flag==1].z, bins=100, alpha=0.5, label='jpas mock, subsampled')
plt.hist(df_gaia.z, bins=100, alpha=0.5, label='GAIA QSO')
plt.ylabel('N(z)')
plt.xlabel('z')
plt.legend()
plt.show()

英文:

Here is a solution.

Let's first cut the dataframes into the desired z-range:

left_z_edge, right_z_edge = 0.8, 2.3
stepsize=0.02

df_jpas = df_jpas[(df_jpas.z&gt;left_z_edge)&amp;(df_jpas.z&lt;right_z_edge)]
df_gaia = df_gaia[(df_gaia.z&gt;left_z_edge)&amp;(df_gaia.z&lt;right_z_edge)]

Next, we want to calculate the distributions (or histograms) of these dataframes:

jpas_hist, jpas_bin_edges = np.histogram(df_jpas.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
jpas_bin_centers = (jpas_bin_edges + stepsize/2)[:-1] # instead of using the bin edges I create the bin centers and use them later

gaia_hist, gaia_bin_edges = np.histogram(df_gaia.z, bins = np.arange(left_z_edge,right_z_edge + stepsize, step=stepsize))
gaia_bin_centers = (gaia_bin_edges + stepsize/2)[:-1]

After this is done comes the critical part of the code - dividing gaia_hist by jpas_hist gives us the probability of a galaxy existing in the particular z-bin and this probability is what we will use for subsampling:

jpas_occup_prob = gaia_hist/jpas_hist

Next, we create a function to be applied on the df_jpas dataframe, it creates an additional column that contains a flag if this particular galaxy should be "activated" (dropped or remained) to provide the desired distribution:

def activate_QSO(z_val):
    idx = (np.abs(jpas_bin_centers - z_val)).argmin() # find the closest desscrite z-value to the z of the current QSO
    ocup_prob = jpas_occup_prob[idx] # assign to this entry the its probability of occupation
    activation_flag = int(np.random.random() &lt; ocup_prob)# either activate (1) or not (0) this QSO depending on the probability from above
    return(activation_flag)

df_jpas[&#39;activation_flag&#39;] = df_jpas[&#39;z&#39;].apply(activate_QSO)

Using this flag, we can plot all galaxies containing 1 in this column which gives us the desired distribution:

plt.hist(df_jpas[df_jpas.activation_flag==1].z, bins=100, alpha=0.5, label=&#39;jpas mock, subsampled&#39;)
plt.hist(df_gaia.z, bins=100, alpha=0.5, label=&#39;GAIA QSO&#39;)
plt.ylabel(&#39;N(z)&#39;)
plt.xlabel(&#39;z&#39;)
plt.legend()
plt.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何对 Pandas 数据框进行子采样，以使其变量分布适应另一个分布？

问题

答案1

Bug in Secret Auction Program For Loop

不断不断地在其他函数内部调用时不断重新输入输入。

如何在Python中从字符串中移除子集

创建新列基于缺失值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论