2023年5月18日 04:50:23go评论90阅读模式

英文:

How to replace NaN values with the correspondent month and hour mean value

问题

我正在尝试用DataFrame中对应的月份和小时均值来替换NaN值。假设我有一个包含通用测量的DataFrame，其中某些行和列随机没有测量数据。该DataFrame的第一列是具有小时频率的日期时间注册。

我已经创建了另一个DataFrame，用于计算每个月每小时的均值，但我无法用它的均值替换第一个DataFrame中的NaN值。

首先，让我们创建一个类似说明的通用DataFrame：

import pandas as pd
import numpy as np
p = 0.1 
columns = ['A','B','C','D','E','F','G','H','I','J']
size = 1000
df = pd.DataFrame(np.random.randint(0,100,size=(size,len(columns))), columns= columns)
mask = np.random.choice([True,False] , size= df.shape, p=[p,1-p])
df = df.mask(mask)
df.insert(0, 'date' ,pd.date_range('2000-01-01 00:00' , periods= size, freq = 'H'))

然后让我们创建具有均值的DataFrame：

mean_df = df.groupby([df.date.dt.month , df.date.dt.hour]).mean()
mean_df.index.set_names(['month' , 'hour'],inplace=True)
mean_df.reset_index(inplace=True)

我可以为一个列执行此操作，但无法为所有列执行：

empty = np.where(df['A'].isna() == True)[0].tolist()
for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc('A')
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) & (mean_df.hour == df.iat[a,0].hour))[0]),r]

请问您需要关于其他列的类似操作的帮助吗？

英文:

I'm trying to replace the NaN values from a DataFrame with the correspondent month and hour mean value of this DataFrame.

So let's say I have a DataFrame with generic measures where, randomly, there is no measure in some rows and columns. This DataFrame's first column is the datetime registry with hour frequency.

I've created another DataFrame that calculates the mean value for every hour of each month, but i can't replace the NaN values of the first DataFrame with it's mean correspondent value.

First, let's create a generic DataFrame similar to the explained:

import pandas as pd
import numpy as np
p = 0.1 
columns = [&#39;A&#39;,&#39;B&#39;,&#39;C&#39;,&#39;D&#39;,&#39;E&#39;,&#39;F&#39;,&#39;G&#39;,&#39;H&#39;,&#39;I&#39;,&#39;J&#39;]
size = 1000
df = pd.DataFrame(np.random.randint(0,100,size=(size,len(columns))), columns= columns)
mask = np.random.choice([True,False] , size= df.shape, p=[p,1-p])
df = df.mask(mask)
df.insert(0, &#39;date&#39; ,pd.date_range(&#39;2000-01-01 00:00&#39; , periods= size, freq = &#39;H&#39;))

Then lets create the DataFrame with the means values:

mean_df = df.groupby([df.date.dt.month , df.date.dt.hour]).mean()
mean_df.index.set_names([&#39;month&#39; , &#39;hour&#39;],inplace=True)
mean_df.reset_index(inplace=True)

I can make it for one column, but i couldn't make it for all the columns:

empty = np.where(df[&#39;A&#39;].isna() == True)[0].tolist()
for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc(&#39;A&#39;)
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) &amp; (mean_df.hour == df.iat[a,0].hour))[0]),r]

答案1

得分: 0

我猜最简单的方法是遍历每一列：

for c in columns:
    empty = np.where(df[c].isna() == True)[0].tolist()
    for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc()
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) &amp; (mean_df.hour == df.iat[a,0].hour))[0]),r]

英文:

I guess the easiest approach is iterating over every column:

for c in columns:
    empty = np.where(df[c].isna() == True)[0].tolist()
    for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc()
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) &amp; (mean_df.hour == df.iat[a,0].hour))[0]),r]

答案2

得分: 0

以下是翻译好的代码部分：

df['month'] = df['date'].dt.month
df['hour'] = df['date'].dt.hour
def func(x, df):
    return pd.Series([df.loc[int(x['month']), int(x['hour'])][c] if np.isnan(x[c]) else x[c] for c in x.index], index=x.index)
df = df.set_index('date').apply(lambda x: func(x, mean_df.set_index(['month', 'hour'])), axis=1).drop(columns=['month', 'hour']).reset_index()

英文:

Not pretty nor fast but here you go...

df[&#39;month&#39;] = df[&#39;date&#39;].dt.month
df[&#39;hour&#39;] = df[&#39;date&#39;].dt.hour
def func(x, df):
    return pd.Series([df.loc[int(x[&#39;month&#39;]), int(x[&#39;hour&#39;])][c] if np.isnan(x[c]) else x[c] for c in x.index], index=x.index)
df = df.set_index(&#39;date&#39;).apply(lambda x: func(x, mean_df.set_index([&#39;month&#39;, &#39;hour&#39;])), axis=1).drop(columns=[&#39;month&#39;, &#39;hour&#39;]).reset_index()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何用相应的月份和小时均值替换 NaN 值

问题

答案1

答案2

获取亚马逊评论的CSV文件，使用Python和AWS。

如何绘制百分比柱状图并用数值注释。

字典转换为带有列表作为值的数据框

在pandas数据框中获取二级索引的值范围

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。