英文:
Filtering DataFrame based on specific conditions in Python
问题
我有一个包含以下列的DataFrame:INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM 和 DEMANDQTY。我想根据特定条件筛选DataFrame。
条件是,如果DESCRIPTION列包含单词"kids"或"baby",我想在筛选的DataFrame中包括该INVOICE_ID的所有值。换句话说,交易中至少有一件商品应该属于儿童或婴儿类别,才能包括整个交易。
我尝试使用str.contains()方法结合正则表达式模式,但我无法获得期望的结果。
以下是我的代码:
import pandas as pd
# 假设DataFrame命名为'df'
# 根据条件筛选DataFrame
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]
# 打印筛选后的DataFrame
filtered_df
然而,这段代码没有提供预期的结果。它基于单独的行筛选数据帧,而不是考虑整个交易。
请在下面找到测试数据:-
import pandas as pd
import random
import string
import numpy as np
random.seed(42)
np.random.seed(42)
num_transactions = 100
max_items_per_transaction = 6
# 生成可能商品的列表
possible_items = [
"Kids T-shirt", "Baby Onesie", "Kids Socks",
"Men's Shirt", "Women's Dress", "Kids Pants",
"Baby Hat", "Women's Shoes", "Men's Pants",
"Kids Jacket", "Baby Bib", "Men's Hat",
"Women's Skirt", "Kids Shoes", "Baby Romper",
"Men's Sweater", "Kids Gloves", "Baby Blanket"
]
# 创建DataFrame
rows = []
for i in range(num_transactions):
num_items = random.randint(1, max_items_per_transaction)
items = random.sample(possible_items, num_items)
invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
customer_id = i + 1
invoice_id = 1001 + i
for j in range(num_items):
item = items[j]
usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) # 生成随机的6字符USIM值
demand_qty = random.randint(1, 10)
row = {
'INVOICE_DATE': invoice_dates[j],
'COUNTRY': countries[j],
'CUSTOMER_ID': customer_id,
'INVOICE_ID': invoice_id,
'DESCRIPTION': item,
'USIM': usim,
'DEMANDQTY': demand_qty
}
rows.append(row)
df = pd.DataFrame(rows)
# 打印DataFrame
df
有人可以请指导我如何根据描述的条件正确筛选DataFrame吗?非常感谢任何帮助或建议。谢谢!
英文:
I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.
The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.
I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.
Here's my code:
import pandas as pd
# Assuming the DataFrame is named 'df'
# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]
# Print the filtered DataFrame
filtered_df
However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.
Please find below the test data: -
import pandas as pd
import random
import string
import numpy as np
random.seed(42)
np.random.seed(42)
num_transactions = 100
max_items_per_transaction = 6
# Generate a list of possible items
possible_items = [
"Kids T-shirt", "Baby Onesie", "Kids Socks",
"Men's Shirt", "Women's Dress", "Kids Pants",
"Baby Hat", "Women's Shoes", "Men's Pants",
"Kids Jacket", "Baby Bib", "Men's Hat",
"Women's Skirt", "Kids Shoes", "Baby Romper",
"Men's Sweater", "Kids Gloves", "Baby Blanket"
]
# Create the DataFrame
rows = []
for i in range(num_transactions):
num_items = random.randint(1, max_items_per_transaction)
items = random.sample(possible_items, num_items)
invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
customer_id = i + 1
invoice_id = 1001 + i
for j in range(num_items):
item = items[j]
usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) # Generate a random 6-character USIM value
demand_qty = random.randint(1, 10)
row = {
'INVOICE_DATE': invoice_dates[j],
'COUNTRY': countries[j],
'CUSTOMER_ID': customer_id,
'INVOICE_ID': invoice_id,
'DESCRIPTION': item,
'USIM': usim,
'DEMANDQTY': demand_qty
}
rows.append(row)
df = pd.DataFrame(rows)
# Print the DataFrame
df
Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!
答案1
得分: 1
假设以下数据框:
>>> df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
3 another 456
4 one 456
您可以想要保留 INVOICE_ID=123,因为'kids'在行0的描述中:
m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]
输出:
>>> filtered_df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
英文:
Suppose the following dataframe:
>>> df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
3 another 456
4 one 456
You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:
m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]
Output:
>>> filtered_df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论