2023年5月25日 22:07:47go评论103阅读模式

英文:

How to create a new dataframe with unique value on lesser rows and summarize result?

问题

这是您要翻译的内容：

"我有一个名为df的数据框，其中有3列唯一数据，其余是没有标题的结果列。ID将始终与项目和成本关联，例如，苹果的ID将始终为12，成本为5。

我想创建一个类似的数据框，但该数据框必须具有项目的唯一值作为最大行数。其余的列必须计算Y和N的数量以放入列中。如果在苹果的第3列中出现1个Y，则结果将为Y。只有所有值都为N，然后苹果将为N。如果只有NaN，则返回NaN。

这是我期望的数据框，df2：

    ID   Item   cost   3   4  
0   12   Apple   5     Y   Y
1   15   Orange  6     N   Y
2   21   Lemon   6     Y   NaN
3   51   Grape   6     Y   N

我使用这段代码尝试创建df2。但它在YCount函数上给出KeyError：False。

df2 = df.drop_duplicates(subset=['ID', 'Item'], keep='first')
# 复制df到df2，删除重复的ID、Item和保留第一个出现的
df2[df2.columns[3:]] = ''
# 清除列3到4
for i in df2["Item"].unique():
    for x in range(3, len(df2.columns)):
        YCount = (df["Item"] == i).df.iloc[:, x].eq('Y').sum()  # 计算与项目相关的Y的数量
        NCount = (df["Item"] == i).df.iloc[:, x].eq('N').sum()  # 计算与项目相关的N的数量
        if YCount > 0:
            df2.iloc[:, x] = "Y"  # 如果Y出现次数大于0，则放入Y
        elif YCount + NCount == 0:
            df2.iloc[:, x] = ""  # 如果总的Y和N的数量都为0，则放入NaN
        elif YCount == 0 and NCount > 0:
            df2.iloc[:, x] = "N"  # 如果Y为0且N大于0，则放入N

请注意，这段代码中存在一些错误，您可能需要进行修复。

英文:

So i have this dataframe df that have 3 column of unique data and the rest are result column with no headers. The ID will always tie to the item and cost, for example, apple will always have the ID of 12 and cost of 5.

print(df)
    ID   Item   cost   3   4  
0   12   Apple   5     Y   N
1   12   Apple   5     N   N
2   12   Apple   5     Y   N
3   12   Apple   5     Y   Y
4   15   Orange  6     N   Y
5   15   Orange  6     N   N
6   15   Orange  6     N   Y
7   15   Orange  6     N   Y
8   21   Lemon   6     Y   NaN
9   51   Grape   6     Y   N
10  21   Lemon   6     Y   NaN

I want to create a similar dataframe but this dataframe must have the unique value of Item as the maximum number of row. And the rest of column must count the number of Y and N to to put into the column. If 1 Y appears in column 3 on apple, the result will be Y. Only all of value is N, then apple will be N. If there is only NaN, it will return NaN.

Here is my expected dataframe to look like, df2

print(df2)
    ID   Item   cost   3   4  
0   12   Apple   5     Y   Y
1   15   Orange  6     N   Y
2   21   Lemon   6     Y   NaN
3   51   Grape   6     Y   N

I use this code to try to create df2. But it give KeyError: False on the YCount function.


df2 = df.drop_duplicates(subset=[&#39;ID&#39;, &#39;Item&#39;], keep=&#39;first&#39;)   
#copy df to df2 with ID, Item, cost duplicates removed
df2[df2.columns[3:]] = &#39;&#39;
#clear column 3 to 4
for i in df2[&quot;Item&quot;].unique():
    for x in range(3, len(df2.columns)):
        YCount =(df[&quot;Item&quot; == i].df.iloc[:,x] == &#39;Y&#39;).sum()    #count number of Y corresponding to the item
        NCount =(df[&quot;Item&quot; == i].df.iloc[:,x] == &#39;N&#39;).sum()    #count number of N corresponding to the item
        if YCount &gt; 0:
            df2.iloc[:,x] = &quot;Y&quot;                                #if more than zero Y appears, put Y
        elif YCount + NCount == 0:
            df2.iloc[:,x] = &quot;&quot;                                 #if total Y and N is 0, put NaN
        elif YCount == 0 and NCount &gt; 0:
            df2.iloc[:,x] = &quot;N&quot;                                #if Y=0 and N more than 0, put N

答案1

得分: 0

你想做的很容易使用布尔值来实现，只需将 'Y' 替换为 True，将 'N' 替换为 False，然后执行 groupby.max：

(df
 .replace({'&#39;Y&#39;': True, '&#39;N&#39;': False})
 .groupby(['ID', 'Item', 'cost'], as_index=False)
 .max()
 .replace({True: '&#39;Y&#39;', False: '&#39;N&#39;'})
)

请注意，'Y' 在 'N' 之后按字典顺序排序，所以您也可以使用以下方式：

out = df.groupby(['ID', 'Item', 'cost'], as_index=False).max()

但这仅适用于这些特定值。如果混合了 'Y'/'N'/NaN，这种方法也不够健壮。

输出：

   ID    Item  cost  3    4
0  12   Apple     5  Y    Y
1  15  Orange     6  N    Y
2  21   Lemon     6  Y  NaN
3  51   Grape     6  Y    N

英文:

What you want to do is very easy with booleans, so just replace 'Y' by True and 'N' by False, the perform a groupby.max:

(df
 .replace({&#39;Y&#39;: True, &#39;N&#39;: False})
 .groupby([&#39;ID&#39;, &#39;Item&#39;, &#39;cost&#39;], as_index=False)
 .max()
 .replace({True: &#39;Y&#39;, False: &#39;N&#39;})
)

Note that 'Y' is lexicographically sorted after 'N', so you could also use:

out = df.groupby([&#39;ID&#39;, &#39;Item&#39;, &#39;cost&#39;], as_index=False).max()

but this only works with these particular values. This will also be less robust if you have a mix of 'Y'/'N'/NaN.

Output:

   ID    Item  cost  3    4
0  12   Apple     5  Y    Y
1  15  Orange     6  N    Y
2  21   Lemon     6  Y  NaN
3  51   Grape     6  Y    N

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建一个新的数据框，其中较少行的数值是唯一的，并总结结果。

问题

答案1

如何减少Python脚本的内存使用？

为什么在Python递归中我的列表中得到了None？

如何将一个数据框按月份的天数进行分割？

如何编写并运行Django用户注册单元测试用例

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。