2023年3月21日 02:27:58go评论116阅读模式

英文:

Is there a way to get my output organized by unique user ID?

问题

user_id = [1, 2, 1, 1, 5, 3, 3, 2, 4, 6]
department = [produce, pets, frozen, pets, pets, meat, other, pets, snacks, snacks]
order_id = [1, 1, 2, 3, 1, 1, 2, 2, 1, 1]
income = [72000, 61888, 72000, 72000, 42867, 85629, 85629, 61888, 57211, 53665]

示例中，我希望通过查找“department”列中指示“宠物”部分的条目来创建宠物主人的配置文件。问题在于，当我计算“department”列中“宠物”条目的数量时，我得到了4。实际情况是只有3个不同的用户购买了宠物用品，而不是4。为了创建准确的客户配置文件，我需要知道有多少个不同的用户在不同的部门购物，而数据集中的计数实例。如果这不清楚，我很抱歉，这是我寻求帮助的第一篇帖子！

我从这里开始：

df.loc[df['department'] == 'pets', 'pet_owner'] = 'Yes'
df.loc[df['department'] != 'pets', 'pet_owner'] = 'No'

然后我意识到问题在于只查看包括宠物用品的订单数量，而不是包括宠物用品的唯一客户订单数量。

然后我尝试了另一个用户推荐的方法：

owns_pets = list(df[df["department"] == 'pets']['user_id'].unique())
no_pets = list(df[df["department"] != 'pets']['user_id'].unique()) 
owns_pets.value_counts(dropna = False)

结果是：AttributeError: 'list' object has no attribute 'value_counts'。

所以我再次改变了策略，并尝试了以下方法：

df.groupby('user_id').agg({'pet_owner' : 'count'})

结果是：user_id pet_owner

1 59
10 143
100 27
1000 103
10000 1092
... ...
99995 50
99996 128
99997 36

显然，计数返回了每个实例，而不仅仅是唯一的user_id。有人能帮助我弄清如何以我需要的方式获取信息吗？TIA

英文:

I have a massive dataset (practice Instacart data) I'm working on in Python pandas. I'm trying to create customer profiles, but have just realized that the set includes over 32 million rows(which are unique orders), but ONLY 206,000 unique customers.

user_id = [1, 2, 1, 1, 5, 3, 3, 2, 4, 6]
department = [produce, pets, frozen, pets, pets, meat, other, pets, snacks, snacks]
order_id = [1, 1, 2, 3, 1, 1, 2, 2, 1, 1]
income = [72000, 61888, 72000, 72000, 42867, 85629, 85629, 61888, 57211, 53665]
     user_id department order_id income
0    1       produce    1        72000
1    2       pets       1        61888
2    1       frozen     2        72000
3    1       pets       3        72000 
4    5       pets       1        42867
5    3       meat       1        85629
6    3       other      2        85629
7    2       pets       2        61888
8    4       snacks     1        57211
9    6       snacks     1        53665

For example, I'd like to create a profile for pet owners by looking for entries that indicate 'pet' in the department column. The issue is that when I count the number of 'pet' entries in the 'department' column, I get 4. The reality is that there are only 3 different users that bought things for pets, not 4.

In order to create an accurate customer profile, I need to be able to know how many unique users shopped in different departments whereas the dataset as it is counts instances.
My apologies if this is not clear, it's my first post for help!

I started with this:

df.loc[df[&#39;department&#39;] == &#39;pets&#39;, &#39;pet_owner&#39;] = &#39;Yes&#39;
df.loc[df[&#39;department&#39;] != &#39;pets&#39;, &#39;pet_owner&#39;] = &#39;No&#39;

Then I realized the analysis issue with just looking at how many ORDERS included pet items, rather than how many unique customers made an order including pet items.

Then I tried this, which was recommended by another user:

owns_pets = list(df[df[&quot;department&quot;] == &#39;pets&#39;][&#39;user_id&#39;].unique())
no_pets = list(df[df[&quot;department&quot;] != &#39;pets&#39;][&#39;user_id&#39;].unique())    
owns_pets.value_counts(dropna = False)
Out: AttributeError: &#39;list&#39; object has no attribute &#39;value_counts&#39;

So I changed tactics again and tried the following:

df.groupby(&#39;user_id&#39;).agg({&#39;pet_owner&#39; : &#39;count&#39;})
Out: user_id  pet_owner
     	
      1	           59
      10	       143
      100	       27
      1000	       103
      10000	       1092
      ...	       ...
      99995	       50
      99996	       128
      99997	       36

Clearly the count is returning each instance rather than just the result 'per' unique user_id.

Can anyone please help me figure out how to get the info the way in which I need? TIA

答案1

得分: 1

以下是您要翻译的内容的翻译部分：

"虽然很难确定您确切的需求，因为您既没有提供任何示例数据，也没有提供所需输出的良好示例。以下是一种选择涉及宠物部门的唯一Cust_ID的方法。

给定以下形式的DataFrame：

	Cust_ID	Purch_Date	Dept
0	4817	2022-05-26	Pets
1	3013	2022-01-12	Pets
2	3013	2022-12-22	Hardware
3	4550	2022-04-21	Pets
4	4817	2022-12-26	Hardware

您可以按如下方式获取唯一cust_id的列表：

list(df[df[&quot;Dept&quot;] == &#39;Pets&#39;][&#39;Cust_ID&#39;].unique())

这将生成仅满足Dept包含'Pets'条件的cust_ids列表，如下所示：

[4817, 3013, 4550]

如果您只需要唯一用户的数量，只需将答案视为：

len(list(df[df[&quot;Dept&quot;] == &#39;Pets&#39;][&#39;Cust_ID&#39;].unique()))

在我的示例中，这将产生3。"

英文:

While it is difficult to be sure of exactly what you want, since you haven't provided any sample data nor provided a good example of the desired output. Here is an approach to select the unique Cust_ID's that have made purchases involving the Pet department.

Given a dF of the following form:

	Cust_ID	Purch_Date	Dept
0	4817	2022-05-26	Pets
1	3013	2022-01-12	Pets
2	3013	2022-12-22	Hardware
3	4550	2022-04-21	Pets
4	4817	2022-12-26	Hardware

You can get a list of the unique cust_id's as follows:

list(df[df[&quot;Dept&quot;] == &#39;Pets&#39;][&#39;Cust_ID&#39;].unique())

This will produce a list of just the cust_ids satisfying the condition that Dept contains 'Pets' as shown below:]

[4817, 3013, 4550]

If all you need is the number of unique users, just take your answer as:

len(list(df[df[&quot;Dept&quot;] == &#39;Pets&#39;][&#39;Cust_ID&#39;].unique()))

Which in my sample case yields 3

答案2

得分: 0

I apologize for being unclear with my question. I did end up getting it answered on another forum though and here is the solution:

df10exuser = df.drop_duplicates(subset=['user_id'])

英文:

I apologize for being unclear with my question. I did end up getting it answered on another forum though and here is the solution:

df10exuser = df.drop_duplicates(subset = [&#39;user_id&#39;])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有一种方法可以按唯一用户ID对我的输出进行组织？

问题

答案1

答案2

如何编写Pyomo优化以选择可再生能源的最佳容量？

VSCode: running the exact same terminal command produces different results depending on if it was run by clicking a UI button

Django: 为相关表构建动态的Q查询

如何将多个字符串并排打印，并跨越多行以固定的输出宽度。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。