创建一个新的数据框,其中较少行的数值是唯一的,并总结结果。

huangapple go评论68阅读模式
英文:

How to create a new dataframe with unique value on lesser rows and summarize result?

问题

这是您要翻译的内容:

"我有一个名为df的数据框,其中有3列唯一数据,其余是没有标题的结果列。ID将始终与项目和成本关联,例如,苹果的ID将始终为12,成本为5。

我想创建一个类似的数据框,但该数据框必须具有项目的唯一值作为最大行数。其余的列必须计算Y和N的数量以放入列中。如果在苹果的第3列中出现1个Y,则结果将为Y。只有所有值都为N,然后苹果将为N。如果只有NaN,则返回NaN。

这是我期望的数据框,df2:

    ID   Item   cost   3   4  
0   12   Apple   5     Y   Y
1   15   Orange  6     N   Y
2   21   Lemon   6     Y   NaN
3   51   Grape   6     Y   N

我使用这段代码尝试创建df2。但它在YCount函数上给出KeyError:False。

df2 = df.drop_duplicates(subset=['ID', 'Item'], keep='first')
# 复制df到df2,删除重复的ID、Item和保留第一个出现的
df2[df2.columns[3:]] = ''
# 清除列3到4

for i in df2["Item"].unique():
    for x in range(3, len(df2.columns)):
        YCount = (df["Item"] == i).df.iloc[:, x].eq('Y').sum()  # 计算与项目相关的Y的数量
        NCount = (df["Item"] == i).df.iloc[:, x].eq('N').sum()  # 计算与项目相关的N的数量
        if YCount > 0:
            df2.iloc[:, x] = "Y"  # 如果Y出现次数大于0,则放入Y
        elif YCount + NCount == 0:
            df2.iloc[:, x] = ""  # 如果总的Y和N的数量都为0,则放入NaN
        elif YCount == 0 and NCount > 0:
            df2.iloc[:, x] = "N"  # 如果Y为0且N大于0,则放入N

请注意,这段代码中存在一些错误,您可能需要进行修复。

英文:

So i have this dataframe df that have 3 column of unique data and the rest are result column with no headers. The ID will always tie to the item and cost, for example, apple will always have the ID of 12 and cost of 5.

print(df)
    ID   Item   cost   3   4  
0   12   Apple   5     Y   N
1   12   Apple   5     N   N
2   12   Apple   5     Y   N
3   12   Apple   5     Y   Y
4   15   Orange  6     N   Y
5   15   Orange  6     N   N
6   15   Orange  6     N   Y
7   15   Orange  6     N   Y
8   21   Lemon   6     Y   NaN
9   51   Grape   6     Y   N
10  21   Lemon   6     Y   NaN

I want to create a similar dataframe but this dataframe must have the unique value of Item as the maximum number of row. And the rest of column must count the number of Y and N to to put into the column. If 1 Y appears in column 3 on apple, the result will be Y. Only all of value is N, then apple will be N. If there is only NaN, it will return NaN.

Here is my expected dataframe to look like, df2

print(df2)
    ID   Item   cost   3   4  
0   12   Apple   5     Y   Y
1   15   Orange  6     N   Y
2   21   Lemon   6     Y   NaN
3   51   Grape   6     Y   N

I use this code to try to create df2. But it give KeyError: False on the YCount function.


df2 = df.drop_duplicates(subset=['ID', 'Item'], keep='first')   
#copy df to df2 with ID, Item, cost duplicates removed
df2[df2.columns[3:]] = ''
#clear column 3 to 4

for i in df2["Item"].unique():
    for x in range(3, len(df2.columns)):
        YCount =(df["Item" == i].df.iloc[:,x] == 'Y').sum()    #count number of Y corresponding to the item
        NCount =(df["Item" == i].df.iloc[:,x] == 'N').sum()    #count number of N corresponding to the item
        if YCount > 0:
            df2.iloc[:,x] = "Y"                                #if more than zero Y appears, put Y
        elif YCount + NCount == 0:
            df2.iloc[:,x] = ""                                 #if total Y and N is 0, put NaN
        elif YCount == 0 and NCount > 0:
            df2.iloc[:,x] = "N"                                #if Y=0 and N more than 0, put N

答案1

得分: 0

你想做的很容易使用布尔值来实现,只需将 'Y' 替换为 True,将 'N' 替换为 False,然后执行 groupby.max

(df
 .replace({''Y'': True, ''N'': False})
 .groupby(['ID', 'Item', 'cost'], as_index=False)
 .max()
 .replace({True: ''Y'', False: ''N''})
)

请注意,'Y''N' 之后按字典顺序排序,所以您也可以使用以下方式:

out = df.groupby(['ID', 'Item', 'cost'], as_index=False).max()

但这仅适用于这些特定值。如果混合了 'Y'/'N'/NaN,这种方法也不够健壮。

输出:

   ID    Item  cost  3    4
0  12   Apple     5  Y    Y
1  15  Orange     6  N    Y
2  21   Lemon     6  Y  NaN
3  51   Grape     6  Y    N
英文:

What you want to do is very easy with booleans, so just replace 'Y' by True and 'N' by False, the perform a groupby.max:

(df
 .replace({'Y': True, 'N': False})
 .groupby(['ID', 'Item', 'cost'], as_index=False)
 .max()
 .replace({True: 'Y', False: 'N'})
)

Note that 'Y' is lexicographically sorted after 'N', so you could also use:

out = df.groupby(['ID', 'Item', 'cost'], as_index=False).max()

but this only works with these particular values. This will also be less robust if you have a mix of 'Y'/'N'/NaN.

Output:

   ID    Item  cost  3    4
0  12   Apple     5  Y    Y
1  15  Orange     6  N    Y
2  21   Lemon     6  Y  NaN
3  51   Grape     6  Y    N

huangapple
  • 本文由 发表于 2023年5月25日 22:07:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333214.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定