2023年4月19日 18:01:55go评论98阅读模式

英文:

Building a new pd.dataframe with statistics from own functions

问题

I can help you with the translation:

我正在尝试为我正在处理的文本数据创建一些摘要统计信息，主要是数据框中文本列的平均长度。

我正在使用两列：short 和 long

import pandas as pd
data = {
    'short': ['水果', '车辆', '动物', '城市'],
    'long': ['通常是甜的，生长在树上或植物上的可食用物体。',
             '用于将人或货物从一地运送到另一地的交通工具。',
             '通常以有机物为食，具有移动能力的生物。',
             '大型且人口众多的定居点，通常是政府所在地或重要文化机构的所在地。']
}
df = pd.DataFrame(data)

为此，我编写了一个函数，用于计算单个列的信息：

def mean_chars_in_col(data: pd.DataFrame, 
                      col: str):
    return data[col].str.len().mean().round(2)

这为我提供了单个列所需的信息：

mean_chars_in_col(df, "short")
## 5.5

如何扩展此功能以便为数据框中的每个列获取此信息，最好输出一个新的数据框，其中有两列：colname 和 mean_chars，将原始数据框中的每个观察列作为colname的行，将每个结果作为mean_chars的行？

是否可能添加新的函数，然后将其输出写入其自己的列，以扩展此功能？

# 期望的输出，最好是一个新的数据框
colname, mean_chars, mean_length
short, 5.5, 1
long, 85, 15.25

谢谢你。

英文:

I am trying to creat some summary statistics for text data that I am working with, namely the average length of text columns in my dataFrame.

I am working with two columns: short and long

import pandas as pd
data = {
    &#39;short&#39;: [&#39;Fruit&#39;, &#39;Vehicle&#39;, &#39;Animal&#39;, &#39;City&#39;],
    &#39;long&#39;: [&#39;An edible object that is usually sweet and grows on trees or plants.&#39;, 
             &#39;A mode of transportation that is used to move people or goods from one place to another.&#39;, 
             &#39;A living organism that typically feeds on organic matter and has the ability to move.&#39;, 
             &#39;A large and populous settlement, usually the seat of government or important cultural institutions.&#39;]
}
df = pd.DataFrame(data)

For that I have written a function that calculates that information for a single column:

def mean_chars_in_col(data: pd.DataFrame, 
                      col: str):
    return data[col].str.len().mean().round(2)

This gives me the needed information for a single column:

mean_chars_in_col(df, &quot;short&quot;)
## 5.5

How can I extend this functionality so that I get this information for every column in the dataFrame, ideally with a new dataframe as output where I have two columns: colname and mean_chars, denoting each observed column in the original dataframe as a row in colname and each result as a row in mean_chars?

Would it also be possible to add onto that with new functions, that then write their output to their own column for each row?

#expected output, ideally a new dataframe
colname, mean_chars, mean_length
short, 5.5, 1
long, 85, 15.25

Thank you.

答案1

得分: 3

You can use DataFrame.apply with a list of functions, then transpose the result and set new column names using a list of names:

def mean_chars_in_col(data: pd.Series):
    return data.str.len().mean()
def mean_length_in_col(data: pd.Series):
    return data.str.split().str.len().mean()
df = (df.apply([mean_chars_in_col, mean_length_in_col])
        .T
        .set_axis(['mean_chars', 'mean_length'], axis=1))

Or, you can change the function to return a Series:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars', 'mean_length'])
df = df.apply(mean_chars_in_col).T

Another option is to use rename_axis and reset_index:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=['mean_chars', 'mean_length'])
out = df.apply(mean_chars_in_col).T.rename_axis('colname').reset_index()

These code snippets provide different ways to compute and organize mean character lengths and mean word lengths for your DataFrame.

英文:

You can use DataFrame.apply with list of functions, then transpose and set new columns names by list:

def mean_chars_in_col(data: pd.Series):
    return data.str.len().mean()
def mean_length_in_col(data: pd.Series):
    return data.str.split().str.len().mean()
df = (df.apply([mean_chars_in_col, mean_length_in_col])
        .T
        .set_axis([&#39;mean_chars&#39;,&#39;mean_length&#39;], axis=1))
print (df)
       mean_chars  mean_length
short         5.5         1.00
long         85.0        14.75

Or change function and return Series:

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=[&#39;mean_chars&#39;,&#39;mean_length&#39;])
df = df.apply(mean_chars_in_col).T
print (df)
       mean_chars  mean_length
short         5.5         1.00
long         85.0        14.75

def mean_chars_in_col(data: pd.Series):
    a = data.str.len().mean()
    b = data.str.split().str.len().mean()
    
    return pd.Series([a, b], index=[&#39;mean_chars&#39;,&#39;mean_length&#39;])
out = df.apply(mean_chars_in_col).T.rename_axis(&#39;colname&#39;).reset_index()
print (out)
  colname  mean_chars  mean_length
0   short         5.5         1.00
1    long        85.0        14.75

答案2

得分: 1

你可以使用pd.apply，但需要修改你的函数，让它直接接受一个Series作为参数：

data = {
    'short': ['水果', '交通工具', '动物', '城市'],
    'long': ['通常是甜的，生长在树上或植物上的可食用物体。',
             '一种用于将人或货物从一地运到另一地的交通工具。',
             '一种通常以有机物为食并能够移动的生物。',
             '一个大而人口众多的定居点，通常是政府或重要文化机构的所在地。']
}
df = pd.DataFrame(data)
# 注意，函数现在只接受一个参数，这个参数将是一个pd.Series
def mean_chars_in_col(col):
    return col.str.len().mean().round(2)
# 然后使用apply来迭代每一列，并将每一列传递到函数中
df.apply(mean_chars_in_col)

这样会得到类似的结果：

short 2.25
long 28.75
dtype: float64

英文:

You can use pd.apply, but to do that you will have to modify your function so that takes directly a Series as an argumetn:

data = {
    &#39;short&#39;: [&#39;Fruit&#39;, &#39;Vehicle&#39;, &#39;Animal&#39;, &#39;City&#39;],
    &#39;long&#39;: [&#39;An edible object that is usually sweet and grows on trees or plants.&#39;, 
             &#39;A mode of transportation that is used to move people or goods from one place to another.&#39;, 
             &#39;A living organism that typically feeds on organic matter and has the ability to move.&#39;, 
             &#39;A large and populous settlement, usually the seat of government or important cultural institutions.&#39;]
}
df = pd.DataFrame(data)
# Notice that the function now only takes one argument, 
# which will be a pd.Series
def mean_chars_in_col(col):
    return col.str.len().mean().round(2)
# then use apply to iterate through each column and pass each column into the funcition 
&gt;&gt;&gt; df.apply(mean_chars_in_col)
short     5.5
long     85.0
dtype: float64
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

建立一个新的pd.dataframe，其中包含来自自定义函数的统计信息。

问题

答案1

答案2

如何将一个列表的列表转换为多层嵌套字典？

如何编写并运行Django用户注册单元测试用例

匹配嵌套列表中的索引和元素

Styleframe 模块 – read_excel_as_template 不起作用，输出一个没有样式的文件。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。