计算pandas DataFrame中每个特定行的中位数值。

huangapple go评论92阅读模式
英文:

Calculate median values for every certain rows in pandas df

问题

I have the following df:

  1. df = pd.DataFrame({
  2. "value": [10,20,30,40,50,60,70,80,90,100]
  3. })

I need to calculate median values for every n rows. Ideally, to write a function where I can pass pd.Series and n as arguments. So, if n=2, my function should return:

  1. median
  2. 15
  3. 35
  4. 55
  5. 75
  6. 95

if n=3, then it should return:

  1. median
  2. 20
  3. 50
  4. 80
  5. 100

In this example when n=3, the last returned value is 100, however in my real dataset, I have a df with thousands of rows, and I want to set my n to 10 or 20. So, as the last median value, it should return the median of n%10.

I included a similar function below just for reference from link, it calculates the mean in the same manner I explained, but I need to tweak it to calculate the median.

  1. def find_mean(col, rows):
  2. """
  3. col: pd.Series
  4. rows: number of rows
  5. """
  6. if isinstance(col, pd.Series):
  7. col = col.to_numpy()
  8. mod = col.shape[0] % rows
  9. if mod != 0:
  10. exclude = col[-mod:]
  11. keep = col[: len(col) - mod]
  12. out = keep.reshape((int(keep.shape[0]/rows), int(rows))).mean(1)
  13. out = np.hstack((out, exclude.mean()))
  14. else:
  15. out = col.reshape((int(col.shape[0]/rows), int(rows))).mean(1)
  16. return out
英文:

I have the following df:

  1. df = pd.DataFrame({
  2. "value": [10,20,30,40,50,60,70,80,90,100]
  3. })

I need to calculate median values for every n rows. Ideally, to write a function where I can pass pd.Series and n as arguments. So, if n=2, my function should return:

  1. median
  2. 15
  3. 35
  4. 55
  5. 75
  6. 95

if n=3, then it should return:

  1. median
  2. 20
  3. 50
  4. 80
  5. 100

In this example when n=3, the last returned value is 100, however in my real dataset, I have a df with thousands of rows, and I want to set my n to 10 or 20. So, as the last median value, it should return the median of n%10.

I included a similar function below just for reference from link, it calculates the mean in the same manner I explained, but I need to tweak it to calculate the median.

  1. def find_mean(col, rows):
  2. """
  3. col: pd.Series
  4. rows: number of rows
  5. """
  6. if isinstance(col, pd.Series):
  7. col = col.to_numpy()
  8. mod = col.shape[0] % rows
  9. if mod != 0:
  10. exclude = col[-mod:]
  11. keep = col[: len(col) - mod]
  12. out = keep.reshape((int(keep.shape[0]/rows), int(rows))).mean(1)
  13. out = np.hstack((out, exclude.mean()))
  14. else:
  15. out = col.reshape((int(col.shape[0]/rows), int(rows))).mean(1)
  16. return out

答案1

得分: 3

你可以使用 groupby

  1. N = 3
  2. df.groupby(np.arange(len(df))//N)['value'].median()

输出:

  1. 0 20
  2. 1 50
  3. 2 80
  4. 3 100
  5. Name: value, dtype: int64
英文:

You can use groupby:

  1. N = 3
  2. df.groupby(np.arange(len(df))//N)['value'].median()

Output:

  1. 0 20
  2. 1 50
  3. 2 80
  4. 3 100
  5. Name: value, dtype: int64

huangapple
  • 本文由 发表于 2023年3月7日 22:42:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663401.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定