英文:
Pandas: combination with highest coverage
问题
在超市里,我选择了30种产品,我们想要进行分析。我想要看到其中的12种产品在特定日期(不涉及时间)内覆盖了最多的客户。
这意味着我有30!/(12!(30-12)!) = 86493225 种产品组合。
我的pandas客户购买数据框:
客户 | 产品 |
---|---|
A | 香蕉 |
B | 苹果 |
B | 香蕉 |
C | 水 |
... |
现在我可以迭代查看具有最多客户数量的组合,首先使用itertools创建它们:
comb = set(itertools.combinations([香蕉,...], 12))
d = {}
for i in comb:
d[i] = df[df.product.isin(i)].Client.nunique()
但这将花费大量的时间。
你们有没有看到更好的方法来计算这个?
请注意,我不想找到12种最常见的产品的组合,而是能够产生最多客户的组合(可能相同,但不一定,因为2种不常见的产品如果两个组不重叠,可能会产生更多客户)。
有什么想法吗?
谢谢
英文:
In a supermarket I selected 30 products on which we want to run an analysis. I want to see which 12 of them give me the widest coverage of clients (on a specific date, no time involved).
This means that I have 30!/(12!(30-12)!) = 86493225 combinations of products
My pandas dataframe of clients purchases:
Client | Product |
---|---|
A | Banana |
B | Apple |
B | Banana |
C | Water |
... |
now I could iterate to see the combination of highest client count, creating them all first with itertools
comb = set(itertools.combinations([banana,...], 12))
d = {}
for i in comb:
d[i] = df[df.product.isin(i)].Client.nunique()
but this will take a spectacular amount of time.
do you guys see any better way to count this out?
please note that I do not want to find the combination of 12 most common products, but the combination that will yield the most clients (possibly the same but not necessarily, as 2 products not individually common may yield more clients if the 2 groups don't overlap much).
any thoughts?
thank you
答案1
得分: 1
这是最大覆盖问题,它是NP难题。该链接展示了一种贪婪算法,它提供了一个近似解决方案:在每一步中,添加具有最多新客户的产品。
英文:
This is the maximum coverage problem and it is NP-hard. That link shows a greedy algorithm which provides an approximate solution: at each step add the product with the most new clients.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论