基于特定条件在Python中筛选DataFrame

huangapple go评论81阅读模式
英文:

Filtering DataFrame based on specific conditions in Python

问题

我有一个包含以下列的DataFrame:INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM 和 DEMANDQTY。我想根据特定条件筛选DataFrame。

条件是,如果DESCRIPTION列包含单词"kids"或"baby",我想在筛选的DataFrame中包括该INVOICE_ID的所有值。换句话说,交易中至少有一件商品应该属于儿童或婴儿类别,才能包括整个交易。

我尝试使用str.contains()方法结合正则表达式模式,但我无法获得期望的结果。

以下是我的代码:

import pandas as pd

# 假设DataFrame命名为'df'

# 根据条件筛选DataFrame
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# 打印筛选后的DataFrame
filtered_df

然而,这段代码没有提供预期的结果。它基于单独的行筛选数据帧,而不是考虑整个交易。

请在下面找到测试数据:-

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# 生成可能商品的列表
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# 创建DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # 生成随机的6字符USIM值
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# 打印DataFrame
df

有人可以请指导我如何根据描述的条件正确筛选DataFrame吗?非常感谢任何帮助或建议。谢谢!

英文:

I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.

基于特定条件在Python中筛选DataFrame

The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.

I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.

Here's my code:

import pandas as pd

# Assuming the DataFrame is named 'df'

# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# Print the filtered DataFrame
filtered_df

However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.

Please find below the test data: -

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# Generate a list of possible items
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# Create the DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # Generate a random 6-character USIM value
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# Print the DataFrame
df

Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!

答案1

得分: 1

假设以下数据框:

>>> df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

您可以想要保留 INVOICE_ID=123,因为'kids'在行0的描述中:

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

输出:

>>> filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
英文:

Suppose the following dataframe:

>>> df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

Output:

>>> filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123

huangapple
  • 本文由 发表于 2023年6月29日 03:23:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定