2023年8月9日 08:50:17go评论118阅读模式

英文:

How can I show the distribution of the columns of a dataset, ordering the images in a specific way?

问题

我正在尝试显示数据集的列分布，其中我有以下变量：['precip', 'pressureChange', 'pressureMeanSeaLevel', 'relativeHumidity', 'snow', 'temperature', 'temperatureDewPoint', 'temperatureFeelsLike', 'uvIndex', 'visibility', 'windDirection', 'windSpeed']，每个变量都有一个测量值，每个测量值都是数据集的一列：

{
'precip': ['meanPrecip', 'minPrecip', 'maxPrecip'],
'pressureChange': ['meanPressurechange', 'minPressurechange', 'maxPressurechange'],
'pressureMeanSeaLevel': ['meanPressuremeansealevel', 'minPressuremeansealevel', 'maxPressuremeansealevel'],
'relativeHumidity': ['meanRelativehumidity', 'minRelativehumidity', 'maxRelativehumidity'],
'snow': ['meanSnow', 'minSnow', 'maxSnow'],
'temperature': ['meanTemperature', 'minTemperature', 'maxTemperature'],
'temperatureDewPoint': ['meantemperatureDewPoint', 'mintemperatureDewPoint', 'maxtemperatureDewPoint'],
'temperatureFeelsLike': ['meanTemperaturefeelslike', 'minTemperaturefeelslike', 'maxTemperaturefeelslike'],
'uvIndex': ['modeUvindex', 'maxUvindex', 'minUvindex'],
'visibility': ['meanVisibility', 'minVisibility', 'maxVisibility'],
'windDirection': ['meanWinddirection'],
'windSpeed': ['meanWindspeed', 'minWindspeed', 'maxWindspeed']
}

我想绘制它们，其中每一行的图形是一个变量，每一列的图形是一个测量值。

我尝试了以下代码：

variable_types = ['precip', 'pressureChange', 'pressureMeanSeaLevel', 'relativeHumidity', 'snow', 'temperature', 'temperatureDewPoint', 'temperatureFeelsLike', 'uvIndex', 'visibility', 'windDirection', 'windSpeed']
variable_columns = {
    'precip': ['meanPrecip', 'minPrecip', 'maxPrecip'],
    'pressureChange': ['meanPressurechange', 'minPressurechange', 'maxPressurechange'],
    'pressureMeanSeaLevel': ['meanPressuremeansealevel', 'minPressuremeansealevel', 'maxPressuremeansealevel'],
    'relativeHumidity': ['meanRelativehumidity', 'minRelativehumidity', 'maxRelativehumidity'],
    'snow': ['meanSnow', 'minSnow', 'maxSnow'],
    'temperature': ['meanTemperature', 'minTemperature', 'maxTemperature'],
    'temperatureDewPoint': ['meanTemperaturedewpoint', 'minTemperaturedewpoint', 'maxTemperaturedewpoint'],
    'temperatureFeelsLike': ['meanTemperaturefeelslike', 'minTemperaturefeelslike', 'maxTemperaturefeelslike'],
    'uvIndex': ['modeUvindex', 'maxUvindex', 'minUvindex'],
    'visibility': ['meanVisibility', 'minVisibility', 'maxVisibility'],
    'windDirection': ['meanWinddirection'],
    'windSpeed': ['meanWindspeed', 'minWindspeed', 'maxWindspeed']
}
for i, var_type in enumerate(variable_types):
    columns = variable_columns[var_type]
    print(columns)
    fig, axes = plt.subplots(nrows=len(variable_types), ncols=len(columns), figsize=(5, 5))
    axes = axes.flat
    
    data = METEO_COMPLETO[columns]
    
    ordered_columns = sorted(columns, key=lambda x: x.split('_')[0])
    print(ordered_columns)
    data[ordered_columns].hist(ax=axes[i], bins=20, alpha=0.7, edgecolor='black', color='skyblue')
    axes[i].set_title(f'Distribución de {var_type}')
    axes[i].set_xlabel('Valor')
    axes[i].set_ylabel('Frecuencia')
    axes[i].legend(ordered_columns)
    plt.tight_layout()
    plt.show()

但我得到了这个结果（最后4个变量）：

另一个可能的选项是在同一图中以不同的颜色显示可用的测量值，以整体上看每个变量的最小值、平均值和最大值的分布。

英文:

I am trying to show the distribution of the columns of a dataset where I have these variables: ['precip', 'pressureChange', 'pressureMeanSeaLevel', 'relativeHumidity', 'snow', 'temperature', temperatureDewPoint', 'temperatureFeelsLike', 'uvIndex', 'visibility', 'windDirection', 'windSpeed'] and from each one a measure, each of these measures is a column of the dataset:

{
&#39;precip&#39;: [&#39;meanPrecip&#39;, &#39;minPrecip&#39;, &#39;maxPrecip&#39;],
&#39;pressureChange&#39;: [&#39;meanPressurechange&#39;, &#39;minPressurechange&#39;, &#39;maxPressurechange&#39;],
&#39;pressureMeanSeaLevel&#39;: [&#39;meanPressuremeansealevel&#39;, &#39;minPressuremeansealevel&#39;, &#39;maxPressuremeansealevel&#39;],
&#39;relativeHumidity&#39;: [&#39;meanRelativehumidity&#39;, &#39;minRelativehumidity&#39;, &#39;maxRelativehumidity&#39;],
&#39;snow&#39;: [&#39;meanSnow&#39;, &#39;minSnow&#39;, &#39;maxSnow&#39;],
&#39;temperature&#39;: [&#39;meanTemperature&#39;, &#39;minTemperature&#39;, &#39;maxTemperature&#39;],
&#39;temperatureDewPoint&#39;: [&#39;meantemperatureDewPoint&#39;, &#39;mintemperatureDewPoint&#39;, &#39;maxtemperatureDewPoint&#39;],
&#39;temperatureFeelsLike&#39;: [&#39;meanTemperaturefeelslike&#39;, &#39;minTemperaturefeelslike&#39;, &#39;maxTemperaturefeelslike&#39;],
&#39;uvIndex&#39;: [&#39;modeUvindex&#39;, &#39;maxUvindex&#39;, &#39;minUvindex&#39;],
&#39;visibility&#39;: [&#39;meanVisibility&#39;, &#39;minVisibility&#39;, &#39;maxVisibility&#39;],
&#39;windDirection&#39;: [&#39;meanWinddirection&#39;],
&#39;windSpeed&#39;: [&#39;meanWindspeed&#39;, &#39;minWindspeed&#39;, &#39;maxWindspeed&#39;].
}

I want to plot them where each row of graphs is a variable and each column of graphs is a measurement.

I tried that:


variable_types = [&#39;precip&#39;, &#39;pressureChange&#39;, &#39;pressureMeanSeaLevel&#39;, &#39;relativeHumidity&#39;, &#39;snow&#39;, &#39;temperature&#39;, &#39;temperatureDewPoint&#39;, &#39;temperatureFeelsLike&#39;, &#39;uvIndex&#39;, &#39;visibility&#39;, &#39;windDirection&#39;, &#39;windSpeed&#39;]
variable_columns = {
    &#39;precip&#39;: [&#39;meanPrecip&#39;, &#39;minPrecip&#39;, &#39;maxPrecip&#39;],
    &#39;pressureChange&#39;: [&#39;meanPressurechange&#39;, &#39;minPressurechange&#39;, &#39;maxPressurechange&#39;],
    &#39;pressureMeanSeaLevel&#39;: [&#39;meanPressuremeansealevel&#39;, &#39;minPressuremeansealevel&#39;, &#39;maxPressuremeansealevel&#39;],
    &#39;relativeHumidity&#39;: [&#39;meanRelativehumidity&#39;, &#39;minRelativehumidity&#39;, &#39;maxRelativehumidity&#39;],
    &#39;snow&#39;: [&#39;meanSnow&#39;, &#39;minSnow&#39;, &#39;maxSnow&#39;],
    &#39;temperature&#39;: [&#39;meanTemperature&#39;, &#39;minTemperature&#39;, &#39;maxTemperature&#39;],
    &#39;temperatureDewPoint&#39;: [&#39;meanTemperaturedewpoint&#39;, &#39;minTemperaturedewpoint&#39;, &#39;maxTemperaturedewpoint&#39;],
    &#39;temperatureFeelsLike&#39;: [&#39;meanTemperaturefeelslike&#39;, &#39;minTemperaturefeelslike&#39;, &#39;maxTemperaturefeelslike&#39;],
    &#39;uvIndex&#39;: [&#39;modeUvindex&#39;, &#39;maxUvindex&#39;, &#39;minUvindex&#39;],
    &#39;visibility&#39;: [&#39;meanVisibility&#39;, &#39;minVisibility&#39;, &#39;maxVisibility&#39;],
    &#39;windDirection&#39;: [&#39;meanWinddirection&#39;],
    &#39;windSpeed&#39;: [&#39;meanWindspeed&#39;, &#39;minWindspeed&#39;, &#39;maxWindspeed&#39;]
}
for i, var_type in enumerate(variable_types):
    columns = variable_columns[var_type]
    print(columns)
    fig, axes = plt.subplots(nrows=len(variable_types), ncols=len(columns), figsize=(5, 5))
    axes = axes.flat
    
    
    data = METEO_COMPLETO[columns]
    
    ordered_columns = sorted(columns, key=lambda x: x.split(&#39;_&#39;)[0])
    print(ordered_columns)
    data[ordered_columns].hist(ax=axes[i], bins=20, alpha=0.7, edgecolor=&#39;black&#39;, color=&#39;skyblue&#39;)
    axes[i].set_title(f&#39;Distribuci&#243;n de {var_type}&#39;)
    axes[i].set_xlabel(&#39;Valor&#39;)
    axes[i].set_ylabel(&#39;Frecuencia&#39;)
    axes[i].legend(ordered_columns)
    plt.tight_layout()
    plt.show()

But I got that result (last 4 variables):

Another possible option is to display the available measures in the same graph with different colors to see the distributions of the minimum, mean and maximum value of each variable as a whole.

答案1

得分: 1

对于我来说，通过嵌套两个for循环更容易保持这种图表布局的“网格思维”，一个处理图表的行（即“variable”），一个处理图表的列（我将其命名为“statistics”）。

然而，整个图表变得太大，无法得到良好的格式，所以我想这不是以单个图表的方式呈现数据的最佳方法。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
variable_columns = {
    "precip": ["meanPrecip", "minPrecip", "maxPrecip"],
    "pressureChange": ["meanPressurechange", "minPressurechange", "maxPressurechange"],
    "pressureMeanSeaLevel": ["meanPressuremeansealevel", "minPressuremeansealevel", "maxPressuremeansealevel"],
    "relativeHumidity": ["meanRelativehumidity", "minRelativehumidity", "maxRelativehumidity"],
    "snow": ["meanSnow", "minSnow", "maxSnow"],
    "temperature": ["meanTemperature", "minTemperature", "maxTemperature"],
    "temperatureDewPoint": ["meanTemperaturedewpoint", "minTemperaturedewpoint", "maxTemperaturedewpoint"],
    "temperatureFeelsLike": ["meanTemperaturefeelslike", "minTemperaturefeelslike", "maxTemperaturefeelslike"],
    "uvIndex": ["meanUvindex", "maxUvindex", "minUvindex"],
    "visibility": ["meanVisibility", "minVisibility", "maxVisibility"],
    "windDirection": ["meanWinddirection"],
    "windSpeed": ["meanWindspeed", "minWindspeed", "maxWindspeed"],
}
# 生成虚假数据
meteo_data = {}
for cols in variable_columns.values():
    for col in cols:
        meteo_data[col] = np.random.randn(1000)
METEO_COMPLETO = pd.DataFrame(meteo_data)
# 统计列表以获得正确的图表网格和列名
statistics = ["max", "mean", "min"]
fig, axes = plt.subplots(
    nrows=len(variable_columns), ncols=len(statistics), figsize=(5, 5)
)
# 循环遍历压缩的图表行和变量名
for i_row, (axes_row, variable) in enumerate(zip(axes, variable_columns.keys())):
    # 循环遍历行中的特定图表和相应的统计名称
    for ax, stat in zip(axes_row, statistics):
        meteo_column = stat + variable.title()  # 首字母大写
        try:
            data = METEO_COMPLETO[meteo_column]
        except KeyError:
            # 找不到统计数据（例如maxWinddirection）
            # => 移除轴并进入下一个图表
            ax.axis("off")
            continue
        ax.hist(data)
        ax.set_title(f"分布 {variable}")
        ax.set_xlabel("值")
        ax.set_ylabel("频率")
        ax.legend(col)
        if i_row == 0:
            ax.set_title(stat)
plt.tight_layout()
plt.show()

英文:

For me it is easier to keep the "grid-thought" of such a plot arrangement by nesting two for loops, one dealing with the rows of the plot (i.e. "variable"), and one dealing with the columns of the plot (i have named that "statistics").

However the whole plot gets too large to get nicely formatted, so I guess this is not the best way to present the data in a single plot.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
variable_columns = {
    &quot;precip&quot;: [&quot;meanPrecip&quot;, &quot;minPrecip&quot;, &quot;maxPrecip&quot;],
    &quot;pressureChange&quot;: [&quot;meanPressurechange&quot;, &quot;minPressurechange&quot;, &quot;maxPressurechange&quot;],
    &quot;pressureMeanSeaLevel&quot;: [&quot;meanPressuremeansealevel&quot;, &quot;minPressuremeansealevel&quot;, &quot;maxPressuremeansealevel&quot;],
    &quot;relativeHumidity&quot;: [&quot;meanRelativehumidity&quot;, &quot;minRelativehumidity&quot;, &quot;maxRelativehumidity&quot;],
    &quot;snow&quot;: [&quot;meanSnow&quot;, &quot;minSnow&quot;, &quot;maxSnow&quot;],
    &quot;temperature&quot;: [&quot;meanTemperature&quot;, &quot;minTemperature&quot;, &quot;maxTemperature&quot;],
    &quot;temperatureDewPoint&quot;: [&quot;meanTemperaturedewpoint&quot;, &quot;minTemperaturedewpoint&quot;, &quot;maxTemperaturedewpoint&quot;],
    &quot;temperatureFeelsLike&quot;: [&quot;meanTemperaturefeelslike&quot;, &quot;minTemperaturefeelslike&quot;, &quot;maxTemperaturefeelslike&quot;],
    &quot;uvIndex&quot;: [&quot;meanUvindex&quot;, &quot;maxUvindex&quot;, &quot;minUvindex&quot;],
    &quot;visibility&quot;: [&quot;meanVisibility&quot;, &quot;minVisibility&quot;, &quot;maxVisibility&quot;],
    &quot;windDirection&quot;: [&quot;meanWinddirection&quot;],
    &quot;windSpeed&quot;: [&quot;meanWindspeed&quot;, &quot;minWindspeed&quot;, &quot;maxWindspeed&quot;],
}
# generate fake data
meteo_data = {}
for cols in variable_columns.values():
    for col in cols:
        meteo_data[col] = np.random.randn(1000)
METEO_COMPLETO = pd.DataFrame(meteo_data)
# statistics list to get correct plot grid and column name
statistics = [&quot;max&quot;, &quot;mean&quot;, &quot;min&quot;]
fig, axes = plt.subplots(
    nrows=len(variable_columns), ncols=len(statistics), figsize=(5, 5)
)
# loop over zipped plot rows and variable names
for i_row, (axes_row, variable) in enumerate(zip(axes, variable_columns.keys())):
    # loop over zipped particular plots in row and respective statistic name
    for ax, stat in zip(axes_row, statistics):
        meteo_column = stat + variable.title()  # capitalize first letter
        try:
            data = METEO_COMPLETO[meteo_column]
        except KeyError:
            # statistics not found (e.g. maxWinddirection
            # =&gt; remove axis and go to next plot
            ax.axis(&quot;off&quot;)
            continue
        ax.hist(data)
        ax.set_title(f&quot;Distribuci&#243;n de {variable}&quot;)
        ax.set_xlabel(&quot;Valor&quot;)
        ax.set_ylabel(&quot;Frecuencia&quot;)
        ax.legend(col)
        if i_row == 0:
            ax.set_title(stat)
plt.tight_layout()
plt.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你可以使用特定的方式对数据集的列进行排序，以展示它们的分布。

问题

答案1

多少个日历天已经过自给定日期？

Python API调用采样问题

在pandas数据框中按照某一列进行分组并聚合唯一值。

Python 电报机器人发送带有按钮的消息

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。