2023年5月15日 11:31:45go评论102阅读模式

英文:

Joining string within pandas' groupby conditionally based on matching values within subgroup

问题

I understand that you want to identify combinations of 'Bvg' and 'Food' within the same day and same time in a dataframe, and then create a new dataframe with the time and corresponding combinations. Here's a possible approach in Python:

import pandas as pd
# Your dataframe
data = [
    ["01/11/2019", "A", "Bvg", 1, 0, 1, "-", "Water"],
    # (... add the rest of your data ...)
]
df = pd.DataFrame(data, columns=["DATE", "USER", "TYPE", "MORNING", "AFTN", "NIGHT", "FOOD", "BVG"])
# Create a new column 'TIME' by combining MORNING, AFTN, and NIGHT columns
df['TIME'] = df[['MORNING', 'AFTN', 'NIGHT']].apply(lambda x: ', '.join([col for col in x.index if x[col] == 1]), axis=1)
# Filter rows where TYPE is either 'Bvg' or 'Food'
filtered_df = df[df['TYPE'].isin(['Bvg', 'Food'])]
# Group by DATE and TIME, and aggregate combinations of 'Bvg' and 'Food'
result_df = filtered_df.groupby(['DATE', 'TIME'])[['TYPE', 'BVG', 'FOOD']].apply(lambda x: ', '.join(x['BVG'] + ', ' + x['FOOD'])).reset_index()
result_df.columns = ['TIME', 'COMBINATION']
# Filter rows with only one 'Bvg' and one 'Food'
result_df = result_df[result_df['COMBINATION'].str.count(',') == 1]
print(result_df)

This code first creates a new 'TIME' column by combining the MORNING, AFTN, and NIGHT columns. Then, it filters the dataframe to include only 'Bvg' and 'Food' rows. After grouping by DATE and TIME, it aggregates the combinations and filters for combinations with only one 'Bvg' and one 'Food'.

The result_df should contain the expected outcome with the time and corresponding combinations.

英文:

I have a dataframe in which each user can log multiple entries across a few days. It looks a bit like this:

    DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -

I need to identify all the combinations of Bvg/Food within the same day and same time. The problem is that the TIME data come in multiple binary columns -- I'll need to find a bvg/food combination that both has the value '1' within the sub-group, and identify which time (MORNING/AFTN/NIGHT) does the combination fall onto.

The final dataframe should have the food/bvg combination, and a single time column. For example for the dataset above, the expected outcome would be:

 TIME            COMBINATION     
MORNING         Water, Noodle     
MORNING         Juice, Noodle    
 NIGHT          Water, Rice             
 NIGHT          Tea, Bread
MORNING         Water, Rice

The combinations needs to be of only 1 food & 1 beverage

I've tried conflating the TIME column into a new column of a joint string, for example:

DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG     TIME
01/11/2019    A       Bvg        1       0       1     -       Water   MORNING, NIGHT

And grouped the data by date, user, and time -- but I cannot limit the combination to just two.

I'm quite new to python and I'm flummoxed if there's any way to do this, any help or clue would be very appreciated

答案1

得分: 0

以下是翻译好的部分：

您可以按照以下方式找到每天、每个时间段和用户的所有饮料-食物组合（虽然您没有明确提到“按用户”，但您的期望结果需要它）：

将数据转换为长格式
对所有食物和饮料执行交叉连接，连接方式是按日期、时间段和用户
删除重复项

下面的代码返回所需的结果，请告诉我这是否有帮助！

# 准备步骤：使用 https://stackoverflow.com/a/53692642/8718701 中的解决方案加载数据
from io import StringIO
import pandas as pd
d = '''
DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -
'''
df = pd.read_csv(StringIO(d), sep='\s+')
# 步骤 1 - 将数据转换为长格式
df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'FOOD', 'BVG'],
    value_vars=['MORNING', 'AFTN', 'NIGHT'],
    var_name='TIME_OF_DAY',
    value_name='VALUE'
)
df = df.loc[df['VALUE'] == 1, :]  # 移除空行
df.drop(columns=['VALUE'], inplace=True)
df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'TIME_OF_DAY'],
    value_vars=['FOOD', 'BVG'],
    var_name='PRODUCT_TYPE',
    value_name='PRODUCT'
)
df = df.loc[df['PRODUCT'] != '-', :]  # 移除空行
# 步骤 2 - 对 BVG 和 FOOD 执行交叉连接，以识别相同日期、时间和用户的所有组合
df = pd.merge(
    df.loc[df['PRODUCT_TYPE']=='BVG', :],
    df.loc[df['PRODUCT_TYPE']=='FOOD', :],
    on=['DATE', 'TIME_OF_DAY', 'USER'])
df['PRODUCT_COMBINATION'] = df.agg('{0[PRODUCT_x]}, {0[PRODUCT_y]}'.format, axis=1)
# 步骤 3 - 选择最终输出所需的列并删除重复项
df = df.loc[:, ['DATE', 'TIME_OF_DAY', 'PRODUCT_COMBINATION']]
df.drop_duplicates(inplace=True)
print(df.to_markdown(index=False))

返回结果如下：

| DATE       | TIME_OF_DAY   | PRODUCT_COMBINATION   |
|:-----------|:--------------|:----------------------|
| 01/11/2019 | MORNING       | Water, Noodle         |
| 01/11/2019 | MORNING       | Juice, Noodle         |
| 01/11/2019 | MORNING       | Water, Rice           |
| 01/11/2019 | NIGHT         | Water, Rice           |
| 02/11/2019 | NIGHT         | Tea, Bread            |

英文:

You can find all beverage-food combinations per day, per time of day, and user (you don't explicitly mention by user but your expected outcome requires it) by:

Converting the data into long format
Performing a cross join of all food and beverages, joining on day, time of day, and user
Dropping duplicates

The code below returns the desired result, let me know of this helped!

# PREPARATORY STEP: LOAD DATA USING SOLUTION FROM https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import pandas as pd
d = &#39;&#39;&#39;
DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -
&#39;&#39;&#39;
df = pd.read_csv(StringIO(d), sep=&#39;\s+&#39;)
# STEP 1 - CONVERT DATA INTO LONG FORMAT
df = pd.melt(
df,
id_vars=[&#39;DATE&#39;, &#39;USER&#39;, &#39;FOOD&#39;, &#39;BVG&#39;],
value_vars=[&#39;MORNING&#39;, &#39;AFTN&#39;, &#39;NIGHT&#39;],
var_name=&#39;TIME_OF_DAY&#39;,
value_name=&#39;VALUE&#39;
)
df = df.loc[df[&#39;VALUE&#39;] == 1, :]  # remove empty rows
df.drop(columns=[&#39;VALUE&#39;], inplace=True)
df = pd.melt(
df,
id_vars=[&#39;DATE&#39;, &#39;USER&#39;, &#39;TIME_OF_DAY&#39;],
value_vars=[&#39;FOOD&#39;, &#39;BVG&#39;],
var_name=&#39;PRODUCT_TYPE&#39;,
value_name=&#39;PRODUCT&#39;
)
df = df.loc[df[&#39;PRODUCT&#39;] != &#39;-&#39;, :]  # remove empty rows
# STEP 2 - CROSS JOIN OF BVG AND FOOD TO IDENTIFY ALL COMBINATIONS FOR SAME DAY, TIME, AND USER
df = pd.merge(
df.loc[df[&#39;PRODUCT_TYPE&#39;]==&#39;BVG&#39;, :],
df.loc[df[&#39;PRODUCT_TYPE&#39;]==&#39;FOOD&#39;, :],
on=[&#39;DATE&#39;, &#39;TIME_OF_DAY&#39;, &#39;USER&#39;])
df[&#39;PRODUCT_COMBINATION&#39;] = df.agg(&#39;{0[PRODUCT_x]}, {0[PRODUCT_y]}&#39;.format, axis=1)
# STEP 3 - SELECT COLUMNS REQUIRED FOR FINAL OUTPUT AND DROP DUPLICATES
df = df.loc[:, [&#39;DATE&#39;, &#39;TIME_OF_DAY&#39;, &#39;PRODUCT_COMBINATION&#39;]]
df.drop_duplicates(inplace=True)
print(df.to_markdown(index=False))

Returns:

| DATE       | TIME_OF_DAY   | PRODUCT_COMBINATION   |
|:-----------|:--------------|:----------------------|
| 01/11/2019 | MORNING       | Water, Noodle         |
| 01/11/2019 | MORNING       | Juice, Noodle         |
| 01/11/2019 | MORNING       | Water, Rice           |
| 01/11/2019 | NIGHT         | Water, Rice           |
| 02/11/2019 | NIGHT         | Tea, Bread            |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在 pandas 的 groupby 条件下，基于子组内匹配值的条件性字符串连接。

问题

答案1

如何在R中从数据框中删除科学计数法。

连接数组并从Firestore中获取字符串

FastAPI：如何仅为特定端点启用跨源资源共享（CORS）？

使用cartopy关键字’central_longitude()’时，出现不一致的海岸线()结果。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。