计算pandas DataFrame中每个特定行的中位数值。

huangapple go评论66阅读模式
英文:

Calculate median values for every certain rows in pandas df

问题

I have the following df:

df = pd.DataFrame({
    "value": [10,20,30,40,50,60,70,80,90,100]
})

I need to calculate median values for every n rows. Ideally, to write a function where I can pass pd.Series and n as arguments. So, if n=2, my function should return:

median
15
35
55
75
95

if n=3, then it should return:

median
20
50
80
100

In this example when n=3, the last returned value is 100, however in my real dataset, I have a df with thousands of rows, and I want to set my n to 10 or 20. So, as the last median value, it should return the median of n%10.

I included a similar function below just for reference from link, it calculates the mean in the same manner I explained, but I need to tweak it to calculate the median.

def find_mean(col, rows):
    """
    col: pd.Series
    rows: number of rows
    """
    if isinstance(col, pd.Series):
        col = col.to_numpy()
    mod = col.shape[0] % rows

    if  mod != 0:
        exclude = col[-mod:]
        keep = col[: len(col) - mod]
        out = keep.reshape((int(keep.shape[0]/rows), int(rows))).mean(1)
        out = np.hstack((out, exclude.mean())) 
    else:       
        out = col.reshape((int(col.shape[0]/rows), int(rows))).mean(1)
    return out
英文:

I have the following df:

df = pd.DataFrame({
    "value": [10,20,30,40,50,60,70,80,90,100]
})

I need to calculate median values for every n rows. Ideally, to write a function where I can pass pd.Series and n as arguments. So, if n=2, my function should return:

median
15
35
55
75
95

if n=3, then it should return:

median
20
50
80
100

In this example when n=3, the last returned value is 100, however in my real dataset, I have a df with thousands of rows, and I want to set my n to 10 or 20. So, as the last median value, it should return the median of n%10.

I included a similar function below just for reference from link, it calculates the mean in the same manner I explained, but I need to tweak it to calculate the median.

def find_mean(col, rows):
    """
    col: pd.Series
    rows: number of rows 
    """
    if isinstance(col, pd.Series):
        col = col.to_numpy()
    mod = col.shape[0] % rows
    
    if  mod != 0:
        exclude = col[-mod:]
        keep = col[: len(col) - mod]
        out = keep.reshape((int(keep.shape[0]/rows), int(rows))).mean(1)
        out = np.hstack((out, exclude.mean())) 
    else:       
        out = col.reshape((int(col.shape[0]/rows), int(rows))).mean(1)
    return out 

答案1

得分: 3

你可以使用 groupby

N = 3
df.groupby(np.arange(len(df))//N)['value'].median()

输出:

0     20
1     50
2     80
3    100
Name: value, dtype: int64
英文:

You can use groupby:

N = 3
df.groupby(np.arange(len(df))//N)['value'].median()

Output:

0     20
1     50
2     80
3    100
Name: value, dtype: int64

huangapple
  • 本文由 发表于 2023年3月7日 22:42:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663401.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定