建立一个新的pd.dataframe,其中包含来自自定义函数的统计信息。

huangapple go评论67阅读模式
英文:

Building a new pd.dataframe with statistics from own functions

问题

I can help you with the translation:

我正在尝试为我正在处理的文本数据创建一些摘要统计信息,主要是数据框中文本列的平均长度。

我正在使用两列:shortlong

import pandas as pd

data = {
    'short': ['水果', '车辆', '动物', '城市'],
    'long': ['通常是甜的,生长在树上或植物上的可食用物体。',
             '用于将人或货物从一地运送到另一地的交通工具。',
             '通常以有机物为食,具有移动能力的生物。',
             '大型且人口众多的定居点,通常是政府所在地或重要文化机构的所在地。']
}

df = pd.DataFrame(data)

为此,我编写了一个函数,用于计算单个列的信息:

def mean_chars_in_col(data: pd.DataFrame, 
                      col: str):
    return data[col].str.len().mean().round(2)

这为我提供了单个列所需的信息:

mean_chars_in_col(df, "short")

## 5.5

如何扩展此功能以便为数据框中的每个列获取此信息,最好输出一个新的数据框,其中有两列:colnamemean_chars,将原始数据框中的每个观察列作为colname的行,将每个结果作为mean_chars的行?

是否可能添加新的函数,然后将其输出写入其自己的列,以扩展此功能?

# 期望的输出,最好是一个新的数据框
colname, mean_chars, mean_length
short, 5.5, 1
long, 85, 15.25

谢谢你。

英文:

I am trying to creat some summary statistics for text data that I am working with, namely the average length of text columns in my dataFrame.

I am working with two columns: short and long

import pandas as pd

data = {
    'short': ['Fruit', 'Vehicle', 'Animal', 'City'],
    'long': ['An edible object that is usually sweet and grows on trees or plants.', 
             'A mode of transportation that is used to move people or goods from one place to another.', 
             'A living organism that typically feeds on organic matter and has the ability to move.', 
             'A large and populous settlement, usually the seat of government or important cultural institutions.']
}

df = pd.DataFrame(data)

For that I have written a function that calculates that information for a single column:

def mean_chars_in_col(data: pd.DataFrame, 
                      col: str):
    return data[col].str.len().mean().round(2)

This gives me the needed information for a single column:

mean_chars_in_col(df, "short")

## 5.5

How can I extend this functionality so that I get this information for every column in the dataFrame, ideally with a new dataframe as output where I have two columns: colname and mean_chars, denoting each observed column in the original dataframe as a row in colname and each result as a row in mean_chars?

Would it also be possible to add onto that with new functions, that then write their output to their own column for each row?

#expected output, ideally a new dataframe
colname, mean_chars, mean_length
short, 5.5, 1
long, 85, 15.25

Thank you.

答案1

得分: 3

You can use DataFrame.apply with a list of functions, then transpose the result and set new column names using a list of names:

def mean_chars_in_col(data: pd.Series):
    return data.str.len().mean()

def mean_length_in_col(data: pd.Series):
    return data.str.split().str.len().mean()

df = (df.apply([mean_chars_in_col, mean_length_in_col])
        .T
        .set_axis(['mean_chars', 'mean_length'], axis=1))

Or, you can change the function to return a Series:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars', 'mean_length'])

df = df.apply(mean_chars_in_col).T

Another option is to use rename_axis and reset_index:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars', 'mean_length'])

out = df.apply(mean_chars_in_col).T.rename_axis('colname').reset_index()

These code snippets provide different ways to compute and organize mean character lengths and mean word lengths for your DataFrame.

英文:

You can use DataFrame.apply with list of functions, then transpose and set new columns names by list:

def mean_chars_in_col(data: pd.Series):
    return data.str.len().mean()

def mean_length_in_col(data: pd.Series):
    return data.str.split().str.len().mean()


df = (df.apply([mean_chars_in_col, mean_length_in_col])
        .T
        .set_axis(['mean_chars','mean_length'], axis=1))
print (df)
       mean_chars  mean_length
short         5.5         1.00
long         85.0        14.75

Or change function and return Series:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars','mean_length'])

df = df.apply(mean_chars_in_col).T
print (df)
       mean_chars  mean_length
short         5.5         1.00
long         85.0        14.75

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars','mean_length'])

out = df.apply(mean_chars_in_col).T.rename_axis('colname').reset_index()
print (out)
  colname  mean_chars  mean_length
0   short         5.5         1.00
1    long        85.0        14.75

答案2

得分: 1

你可以使用pd.apply,但需要修改你的函数,让它直接接受一个Series作为参数:

data = {
    'short': ['水果', '交通工具', '动物', '城市'],
    'long': ['通常是甜的,生长在树上或植物上的可食用物体。',
             '一种用于将人或货物从一地运到另一地的交通工具。',
             '一种通常以有机物为食并能够移动的生物。',
             '一个大而人口众多的定居点,通常是政府或重要文化机构的所在地。']
}

df = pd.DataFrame(data)

# 注意,函数现在只接受一个参数,这个参数将是一个pd.Series

def mean_chars_in_col(col):
    return col.str.len().mean().round(2)

# 然后使用apply来迭代每一列,并将每一列传递到函数中

df.apply(mean_chars_in_col)

这样会得到类似的结果:

short 2.25
long 28.75
dtype: float64

英文:

You can use pd.apply, but to do that you will have to modify your function so that takes directly a Series as an argumetn:

data = {
    'short': ['Fruit', 'Vehicle', 'Animal', 'City'],
    'long': ['An edible object that is usually sweet and grows on trees or plants.', 
             'A mode of transportation that is used to move people or goods from one place to another.', 
             'A living organism that typically feeds on organic matter and has the ability to move.', 
             'A large and populous settlement, usually the seat of government or important cultural institutions.']
}

df = pd.DataFrame(data)

# Notice that the function now only takes one argument, 
# which will be a pd.Series

def mean_chars_in_col(col):
    return col.str.len().mean().round(2)

# then use apply to iterate through each column and pass each column into the funcition 

>>> df.apply(mean_chars_in_col)
short     5.5
long     85.0
dtype: float64


</details>



huangapple
  • 本文由 发表于 2023年4月19日 18:01:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053201.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定