有没有一种方法可以按唯一用户ID对我的输出进行组织?

huangapple go评论82阅读模式
英文:

Is there a way to get my output organized by unique user ID?

问题

user_id = [1, 2, 1, 1, 5, 3, 3, 2, 4, 6]
department = [produce, pets, frozen, pets, pets, meat, other, pets, snacks, snacks]
order_id = [1, 1, 2, 3, 1, 1, 2, 2, 1, 1]
income = [72000, 61888, 72000, 72000, 42867, 85629, 85629, 61888, 57211, 53665]

示例中,我希望通过查找“department”列中指示“宠物”部分的条目来创建宠物主人的配置文件。问题在于,当我计算“department”列中“宠物”条目的数量时,我得到了4。实际情况是只有3个不同的用户购买了宠物用品,而不是4。为了创建准确的客户配置文件,我需要知道有多少个不同的用户在不同的部门购物,而数据集中的计数实例。如果这不清楚,我很抱歉,这是我寻求帮助的第一篇帖子!

我从这里开始:

df.loc[df['department'] == 'pets', 'pet_owner'] = 'Yes'
df.loc[df['department'] != 'pets', 'pet_owner'] = 'No'

然后我意识到问题在于只查看包括宠物用品的订单数量,而不是包括宠物用品的唯一客户订单数量。

然后我尝试了另一个用户推荐的方法:

owns_pets = list(df[df["department"] == 'pets']['user_id'].unique())
no_pets = list(df[df["department"] != 'pets']['user_id'].unique()) 
owns_pets.value_counts(dropna = False)

结果是:AttributeError: 'list' object has no attribute 'value_counts'。

所以我再次改变了策略,并尝试了以下方法:

df.groupby('user_id').agg({'pet_owner' : 'count'})

结果是:user_id pet_owner

1 59
10 143
100 27
1000 103
10000 1092
... ...
99995 50
99996 128
99997 36

显然,计数返回了每个实例,而不仅仅是唯一的user_id。有人能帮助我弄清如何以我需要的方式获取信息吗?TIA

英文:

I have a massive dataset (practice Instacart data) I'm working on in Python pandas. I'm trying to create customer profiles, but have just realized that the set includes over 32 million rows(which are unique orders), but ONLY 206,000 unique customers.

user_id = [1, 2, 1, 1, 5, 3, 3, 2, 4, 6]
department = [produce, pets, frozen, pets, pets, meat, other, pets, snacks, snacks]
order_id = [1, 1, 2, 3, 1, 1, 2, 2, 1, 1]
income = [72000, 61888, 72000, 72000, 42867, 85629, 85629, 61888, 57211, 53665]

     user_id department order_id income
0    1       produce    1        72000
1    2       pets       1        61888
2    1       frozen     2        72000
3    1       pets       3        72000 
4    5       pets       1        42867
5    3       meat       1        85629
6    3       other      2        85629
7    2       pets       2        61888
8    4       snacks     1        57211
9    6       snacks     1        53665

For example, I'd like to create a profile for pet owners by looking for entries that indicate 'pet' in the department column. The issue is that when I count the number of 'pet' entries in the 'department' column, I get 4. The reality is that there are only 3 different users that bought things for pets, not 4.

In order to create an accurate customer profile, I need to be able to know how many unique users shopped in different departments whereas the dataset as it is counts instances.
My apologies if this is not clear, it's my first post for help!

I started with this:

df.loc[df['department'] == 'pets', 'pet_owner'] = 'Yes'
df.loc[df['department'] != 'pets', 'pet_owner'] = 'No'

Then I realized the analysis issue with just looking at how many ORDERS included pet items, rather than how many unique customers made an order including pet items.

Then I tried this, which was recommended by another user:

owns_pets = list(df[df["department"] == 'pets']['user_id'].unique())

no_pets = list(df[df["department"] != 'pets']['user_id'].unique())    

owns_pets.value_counts(dropna = False)

Out: AttributeError: 'list' object has no attribute 'value_counts'

So I changed tactics again and tried the following:

df.groupby('user_id').agg({'pet_owner' : 'count'})

Out: user_id  pet_owner
     	
      1	           59
      10	       143
      100	       27
      1000	       103
      10000	       1092
      ...	       ...
      99995	       50
      99996	       128
      99997	       36  

Clearly the count is returning each instance rather than just the result 'per' unique user_id.

Can anyone please help me figure out how to get the info the way in which I need? TIA

答案1

得分: 1

以下是您要翻译的内容的翻译部分:

"虽然很难确定您确切的需求,因为您既没有提供任何示例数据,也没有提供所需输出的良好示例。以下是一种选择涉及宠物部门的唯一Cust_ID的方法。

给定以下形式的DataFrame:

	Cust_ID	Purch_Date	Dept
0	4817	2022-05-26	Pets
1	3013	2022-01-12	Pets
2	3013	2022-12-22	Hardware
3	4550	2022-04-21	Pets
4	4817	2022-12-26	Hardware

您可以按如下方式获取唯一cust_id的列表:

list(df[df["Dept"] == 'Pets']['Cust_ID'].unique())  

这将生成仅满足Dept包含'Pets'条件的cust_ids列表,如下所示:

[4817, 3013, 4550]

如果您只需要唯一用户的数量,只需将答案视为:

len(list(df[df["Dept"] == 'Pets']['Cust_ID'].unique()))

在我的示例中,这将产生3。"

英文:

While it is difficult to be sure of exactly what you want, since you haven't provided any sample data nor provided a good example of the desired output. Here is an approach to select the unique Cust_ID's that have made purchases involving the Pet department.

Given a dF of the following form:

	Cust_ID	Purch_Date	Dept
0	4817	2022-05-26	Pets
1	3013	2022-01-12	Pets
2	3013	2022-12-22	Hardware
3	4550	2022-04-21	Pets
4	4817	2022-12-26	Hardware

You can get a list of the unique cust_id's as follows:

list(df[df["Dept"] == 'Pets']['Cust_ID'].unique())  

This will produce a list of just the cust_ids satisfying the condition that Dept contains 'Pets' as shown below:]

[4817, 3013, 4550]

If all you need is the number of unique users, just take your answer as:

len(list(df[df["Dept"] == 'Pets']['Cust_ID'].unique()))

Which in my sample case yields 3

答案2

得分: 0

I apologize for being unclear with my question. I did end up getting it answered on another forum though and here is the solution:

df10exuser = df.drop_duplicates(subset=['user_id'])
英文:

I apologize for being unclear with my question. I did end up getting it answered on another forum though and here is the solution:

df10exuser = df.drop_duplicates(subset = ['user_id'])

huangapple
  • 本文由 发表于 2023年3月21日 02:27:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75794019.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定