如何用相应的月份和小时均值替换 NaN 值

huangapple go评论58阅读模式
英文:

How to replace NaN values with the correspondent month and hour mean value

问题

我正在尝试用DataFrame中对应的月份和小时均值来替换NaN值。假设我有一个包含通用测量的DataFrame,其中某些行和列随机没有测量数据。该DataFrame的第一列是具有小时频率的日期时间注册。

我已经创建了另一个DataFrame,用于计算每个月每小时的均值,但我无法用它的均值替换第一个DataFrame中的NaN值。

首先,让我们创建一个类似说明的通用DataFrame:

import pandas as pd
import numpy as np

p = 0.1 
columns = ['A','B','C','D','E','F','G','H','I','J']
size = 1000
df = pd.DataFrame(np.random.randint(0,100,size=(size,len(columns))), columns= columns)
mask = np.random.choice([True,False] , size= df.shape, p=[p,1-p])
df = df.mask(mask)
df.insert(0, 'date' ,pd.date_range('2000-01-01 00:00' , periods= size, freq = 'H'))

然后让我们创建具有均值的DataFrame:

mean_df = df.groupby([df.date.dt.month , df.date.dt.hour]).mean()
mean_df.index.set_names(['month' , 'hour'],inplace=True)
mean_df.reset_index(inplace=True)

我可以为一个列执行此操作,但无法为所有列执行:

empty = np.where(df['A'].isna() == True)[0].tolist()

for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc('A')
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) & (mean_df.hour == df.iat[a,0].hour))[0]),r]

请问您需要关于其他列的类似操作的帮助吗?

英文:

I'm trying to replace the NaN values from a DataFrame with the correspondent month and hour mean value of this DataFrame.

So let's say I have a DataFrame with generic measures where, randomly, there is no measure in some rows and columns. This DataFrame's first column is the datetime registry with hour frequency.

I've created another DataFrame that calculates the mean value for every hour of each month, but i can't replace the NaN values of the first DataFrame with it's mean correspondent value.

First, let's create a generic DataFrame similar to the explained:

import pandas as pd
import numpy as np

p = 0.1 
columns = ['A','B','C','D','E','F','G','H','I','J']
size = 1000
df = pd.DataFrame(np.random.randint(0,100,size=(size,len(columns))), columns= columns)
mask = np.random.choice([True,False] , size= df.shape, p=[p,1-p])
df = df.mask(mask)
df.insert(0, 'date' ,pd.date_range('2000-01-01 00:00' , periods= size, freq = 'H'))

Then lets create the DataFrame with the means values:

mean_df = df.groupby([df.date.dt.month , df.date.dt.hour]).mean()
mean_df.index.set_names(['month' , 'hour'],inplace=True)
mean_df.reset_index(inplace=True)

I can make it for one column, but i couldn't make it for all the columns:

empty = np.where(df['A'].isna() == True)[0].tolist()

for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc('A')
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) & (mean_df.hour == df.iat[a,0].hour))[0]),r]

答案1

得分: 0

我猜最简单的方法是遍历每一列:

for c in columns:
    empty = np.where(df[c].isna() == True)[0].tolist()

    for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc()
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) & (mean_df.hour == df.iat[a,0].hour))[0]),r]
英文:

I guess the easiest approach is iterating over every column:

for c in columns:
    empty = np.where(df[c].isna() == True)[0].tolist()

    for i in range(len(empty)):
        a = empty[i]
        r = df.columns.get_loc()
        df.iat[a, r] = mean_df.iat[int(np.where((mean_df.month == df.iat[a,0].month) & (mean_df.hour == df.iat[a,0].hour))[0]),r]

答案2

得分: 0

以下是翻译好的代码部分:

df['month'] = df['date'].dt.month
df['hour'] = df['date'].dt.hour

def func(x, df):
    return pd.Series([df.loc[int(x['month']), int(x['hour'])][c] if np.isnan(x[c]) else x[c] for c in x.index], index=x.index)
df = df.set_index('date').apply(lambda x: func(x, mean_df.set_index(['month', 'hour'])), axis=1).drop(columns=['month', 'hour']).reset_index()
英文:

Not pretty nor fast but here you go...

df['month'] = df['date'].dt.month
df['hour'] = df['date'].dt.hour

def func(x, df):
    return pd.Series([df.loc[int(x['month']), int(x['hour'])][c] if np.isnan(x[c]) else x[c] for c in x.index], index=x.index)
df = df.set_index('date').apply(lambda x: func(x, mean_df.set_index(['month', 'hour'])), axis=1).drop(columns=['month', 'hour']).reset_index()

huangapple
  • 本文由 发表于 2023年5月18日 04:50:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276107.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定