在 pandas 的 groupby 条件下,基于子组内匹配值的条件性字符串连接。

huangapple go评论58阅读模式
英文:

Joining string within pandas' groupby conditionally based on matching values within subgroup

问题

I understand that you want to identify combinations of 'Bvg' and 'Food' within the same day and same time in a dataframe, and then create a new dataframe with the time and corresponding combinations. Here's a possible approach in Python:

import pandas as pd

# Your dataframe
data = [
    ["01/11/2019", "A", "Bvg", 1, 0, 1, "-", "Water"],
    # (... add the rest of your data ...)
]

df = pd.DataFrame(data, columns=["DATE", "USER", "TYPE", "MORNING", "AFTN", "NIGHT", "FOOD", "BVG"])

# Create a new column 'TIME' by combining MORNING, AFTN, and NIGHT columns
df['TIME'] = df[['MORNING', 'AFTN', 'NIGHT']].apply(lambda x: ', '.join([col for col in x.index if x[col] == 1]), axis=1)

# Filter rows where TYPE is either 'Bvg' or 'Food'
filtered_df = df[df['TYPE'].isin(['Bvg', 'Food'])]

# Group by DATE and TIME, and aggregate combinations of 'Bvg' and 'Food'
result_df = filtered_df.groupby(['DATE', 'TIME'])[['TYPE', 'BVG', 'FOOD']].apply(lambda x: ', '.join(x['BVG'] + ', ' + x['FOOD'])).reset_index()
result_df.columns = ['TIME', 'COMBINATION']

# Filter rows with only one 'Bvg' and one 'Food'
result_df = result_df[result_df['COMBINATION'].str.count(',') == 1]

print(result_df)

This code first creates a new 'TIME' column by combining the MORNING, AFTN, and NIGHT columns. Then, it filters the dataframe to include only 'Bvg' and 'Food' rows. After grouping by DATE and TIME, it aggregates the combinations and filters for combinations with only one 'Bvg' and one 'Food'.

The result_df should contain the expected outcome with the time and corresponding combinations.

英文:

I have a dataframe in which each user can log multiple entries across a few days. It looks a bit like this:

    DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -

I need to identify all the combinations of Bvg/Food within the same day and same time. The problem is that the TIME data come in multiple binary columns -- I'll need to find a bvg/food combination that both has the value '1' within the sub-group, and identify which time (MORNING/AFTN/NIGHT) does the combination fall onto.

The final dataframe should have the food/bvg combination, and a single time column. For example for the dataset above, the expected outcome would be:

 TIME            COMBINATION     
MORNING         Water, Noodle     
MORNING         Juice, Noodle    
 NIGHT          Water, Rice             
 NIGHT          Tea, Bread
MORNING         Water, Rice                     

The combinations needs to be of only 1 food & 1 beverage

I've tried conflating the TIME column into a new column of a joint string, for example:

DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG     TIME
01/11/2019    A       Bvg        1       0       1     -       Water   MORNING, NIGHT

And grouped the data by date, user, and time -- but I cannot limit the combination to just two.

I'm quite new to python and I'm flummoxed if there's any way to do this, any help or clue would be very appreciated

答案1

得分: 0

以下是翻译好的部分:

您可以按照以下方式找到每天、每个时间段和用户的所有饮料-食物组合(虽然您没有明确提到“按用户”,但您的期望结果需要它):

  1. 将数据转换为长格式
  2. 对所有食物和饮料执行交叉连接,连接方式是按日期、时间段和用户
  3. 删除重复项

下面的代码返回所需的结果,请告诉我这是否有帮助!

# 准备步骤:使用 https://stackoverflow.com/a/53692642/8718701 中的解决方案加载数据
from io import StringIO
import pandas as pd

d = '''
DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -
'''

df = pd.read_csv(StringIO(d), sep='\s+')

# 步骤 1 - 将数据转换为长格式
df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'FOOD', 'BVG'],
    value_vars=['MORNING', 'AFTN', 'NIGHT'],
    var_name='TIME_OF_DAY',
    value_name='VALUE'
)

df = df.loc[df['VALUE'] == 1, :]  # 移除空行
df.drop(columns=['VALUE'], inplace=True)

df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'TIME_OF_DAY'],
    value_vars=['FOOD', 'BVG'],
    var_name='PRODUCT_TYPE',
    value_name='PRODUCT'
)

df = df.loc[df['PRODUCT'] != '-', :]  # 移除空行

# 步骤 2 - 对 BVG 和 FOOD 执行交叉连接,以识别相同日期、时间和用户的所有组合
df = pd.merge(
    df.loc[df['PRODUCT_TYPE']=='BVG', :],
    df.loc[df['PRODUCT_TYPE']=='FOOD', :],
    on=['DATE', 'TIME_OF_DAY', 'USER'])

df['PRODUCT_COMBINATION'] = df.agg('{0[PRODUCT_x]}, {0[PRODUCT_y]}'.format, axis=1)

# 步骤 3 - 选择最终输出所需的列并删除重复项
df = df.loc[:, ['DATE', 'TIME_OF_DAY', 'PRODUCT_COMBINATION']]
df.drop_duplicates(inplace=True)

print(df.to_markdown(index=False))

返回结果如下:

| DATE       | TIME_OF_DAY   | PRODUCT_COMBINATION   |
|:-----------|:--------------|:----------------------|
| 01/11/2019 | MORNING       | Water, Noodle         |
| 01/11/2019 | MORNING       | Juice, Noodle         |
| 01/11/2019 | MORNING       | Water, Rice           |
| 01/11/2019 | NIGHT         | Water, Rice           |
| 02/11/2019 | NIGHT         | Tea, Bread            |
英文:

You can find all beverage-food combinations per day, per time of day, and user (you don't explicitly mention by user but your expected outcome requires it) by:

  1. Converting the data into long format
  2. Performing a cross join of all food and beverages, joining on day, time of day, and user
  3. Dropping duplicates

The code below returns the desired result, let me know of this helped!

# PREPARATORY STEP: LOAD DATA USING SOLUTION FROM https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import pandas as pd
d = '''
DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -
'''
df = pd.read_csv(StringIO(d), sep='\s+')
# STEP 1 - CONVERT DATA INTO LONG FORMAT
df = pd.melt(
df,
id_vars=['DATE', 'USER', 'FOOD', 'BVG'],
value_vars=['MORNING', 'AFTN', 'NIGHT'],
var_name='TIME_OF_DAY',
value_name='VALUE'
)
df = df.loc[df['VALUE'] == 1, :]  # remove empty rows
df.drop(columns=['VALUE'], inplace=True)
df = pd.melt(
df,
id_vars=['DATE', 'USER', 'TIME_OF_DAY'],
value_vars=['FOOD', 'BVG'],
var_name='PRODUCT_TYPE',
value_name='PRODUCT'
)
df = df.loc[df['PRODUCT'] != '-', :]  # remove empty rows
# STEP 2 - CROSS JOIN OF BVG AND FOOD TO IDENTIFY ALL COMBINATIONS FOR SAME DAY, TIME, AND USER
df = pd.merge(
df.loc[df['PRODUCT_TYPE']=='BVG', :],
df.loc[df['PRODUCT_TYPE']=='FOOD', :],
on=['DATE', 'TIME_OF_DAY', 'USER'])
df['PRODUCT_COMBINATION'] = df.agg('{0[PRODUCT_x]}, {0[PRODUCT_y]}'.format, axis=1)
# STEP 3 - SELECT COLUMNS REQUIRED FOR FINAL OUTPUT AND DROP DUPLICATES
df = df.loc[:, ['DATE', 'TIME_OF_DAY', 'PRODUCT_COMBINATION']]
df.drop_duplicates(inplace=True)
print(df.to_markdown(index=False))

Returns:

| DATE       | TIME_OF_DAY   | PRODUCT_COMBINATION   |
|:-----------|:--------------|:----------------------|
| 01/11/2019 | MORNING       | Water, Noodle         |
| 01/11/2019 | MORNING       | Juice, Noodle         |
| 01/11/2019 | MORNING       | Water, Rice           |
| 01/11/2019 | NIGHT         | Water, Rice           |
| 02/11/2019 | NIGHT         | Tea, Bread            |

huangapple
  • 本文由 发表于 2023年5月15日 11:31:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76250734.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定