2023年6月29日 03:23:05go评论81阅读模式

英文:

Filtering DataFrame based on specific conditions in Python

问题

我有一个包含以下列的DataFrame：INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM 和 DEMANDQTY。我想根据特定条件筛选DataFrame。

条件是，如果DESCRIPTION列包含单词"kids"或"baby"，我想在筛选的DataFrame中包括该INVOICE_ID的所有值。换句话说，交易中至少有一件商品应该属于儿童或婴儿类别，才能包括整个交易。

我尝试使用str.contains()方法结合正则表达式模式，但我无法获得期望的结果。

以下是我的代码：

import pandas as pd

# 假设DataFrame命名为'df'

# 根据条件筛选DataFrame
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# 打印筛选后的DataFrame
filtered_df

然而，这段代码没有提供预期的结果。它基于单独的行筛选数据帧，而不是考虑整个交易。

请在下面找到测试数据：-

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# 生成可能商品的列表
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# 创建DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # 生成随机的6字符USIM值
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# 打印DataFrame
df

有人可以请指导我如何根据描述的条件正确筛选DataFrame吗？非常感谢任何帮助或建议。谢谢！

英文:

I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.

The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.

I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.

Here's my code:

import pandas as pd

# Assuming the DataFrame is named &#39;df&#39;

# Filter the DataFrame based on the condition
filtered_df = df[df[&#39;DESCRIPTION&#39;].str.contains(&#39;kids|baby&#39;, case=False, regex=True)]

# Print the filtered DataFrame
filtered_df

However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.

Please find below the test data: -

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# Generate a list of possible items
possible_items = [
    &quot;Kids T-shirt&quot;, &quot;Baby Onesie&quot;, &quot;Kids Socks&quot;,
    &quot;Men&#39;s Shirt&quot;, &quot;Women&#39;s Dress&quot;, &quot;Kids Pants&quot;,
    &quot;Baby Hat&quot;, &quot;Women&#39;s Shoes&quot;, &quot;Men&#39;s Pants&quot;,
    &quot;Kids Jacket&quot;, &quot;Baby Bib&quot;, &quot;Men&#39;s Hat&quot;,
    &quot;Women&#39;s Skirt&quot;, &quot;Kids Shoes&quot;, &quot;Baby Romper&quot;,
    &quot;Men&#39;s Sweater&quot;, &quot;Kids Gloves&quot;, &quot;Baby Blanket&quot;
]

# Create the DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start=&#39;2022-01-01&#39;, periods=num_items, freq=&#39;D&#39;)
    countries = random.choices([&#39;USA&#39;, &#39;Canada&#39;, &#39;UK&#39;], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = &#39;&#39;.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # Generate a random 6-character USIM value
        demand_qty = random.randint(1, 10)

        row = {
            &#39;INVOICE_DATE&#39;: invoice_dates[j],
            &#39;COUNTRY&#39;: countries[j],
            &#39;CUSTOMER_ID&#39;: customer_id,
            &#39;INVOICE_ID&#39;: invoice_id,
            &#39;DESCRIPTION&#39;: item,
            &#39;USIM&#39;: usim,
            &#39;DEMANDQTY&#39;: demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# Print the DataFrame
df

Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!

答案1

得分: 1

假设以下数据框：

&gt;&gt;&gt; df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

您可以想要保留 INVOICE_ID=123，因为'kids'在行0的描述中：

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

输出：

&gt;&gt;&gt; filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123

英文:

Suppose the following dataframe:

&gt;&gt;&gt; df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:

m = df[&#39;DESCRIPTION&#39;].str.contains(&#39;kids|baby&#39;, case=False, regex=True)
filtered_df = df[m.groupby(df[&#39;INVOICE_ID&#39;]).transform(&#39;max&#39;)]

Output:

&gt;&gt;&gt; filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于特定条件在Python中筛选DataFrame

问题

答案1

Pandas数据框读取Excel，然后转换为字典。

如何在Python中使用正则表达式匹配模式并优化代码。

从 Python 通过 Selenium Grid 获取剪贴板的值

属性覆盖对象getattribute的原因是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论