2023年5月10日 18:35:28go评论58阅读模式

英文:

Merge all data frame into a single file

问题

我有一个数据框列表，看起来是这样的：

df1

卖家	价值	日期
哈里	2	1月2日
奥姆	12	1月2日
猫	0	1月2日
马特	14	1月2日
约翰	0	1月2日
梅西	1	1月2日
约翰	0	1月2日
雷诺	7	1月2日

Df2

卖家	价值	日期
哈里	0	1月3日
奥姆	14	1月3日
猫	NA	1月3日
马特	14	1月3日
罗恩	0	1月3日
梅西	10	1月3日
约翰	0	1月3日
雷诺	17	1月3日

像这样，我有很多数据框。

将它们全部合并到单个文件中，只保留最新值都大于0的卖家。

最终数据框如下 -

卖家	1月2日值	1月3日值
奥姆	12	14
马特	14	14
梅西	1	10
雷诺	7	17

英文:

I have a list of data frames which looks like this:

df1

Seller	Vaule	Date
Hari	2	Jan-02
Om	12	Jan-02
Cat	0	Jan-02
Mat	14	Jan-02
John	0	Jan-02
Messi	1	Jan-02
John	0	Jan-02
Ronaldo	7	Jan-02

Df2

Seller	Vaule	Date
Hari	0	Jan-03
Om	14	Jan-03
Cat	NA	Jan-03
Mat	14	Jan-03
Ron	0	Jan-03
Messi	10	Jan-03
John	0	Jan-03
Ronaldo	17	Jan-03

Like that I have many data frames

Merge all into single file and keep only the seller whose latest values are all greater than 0.

Final DF would be like -

Seller	Vaule_jan02	Vaule_jan03
Om	12	14
Mat	14	14
Messi	1	10
Ronaldo	7	17

Note - In df1 Hari's value was greater than 0 but in df2 it's value is 0 so we have not included in our final DF.

答案1

得分: 2

'concat' DataFrames按所需顺序连接，然后使用'pivot_table'按卖家获取最后一个值，然后使用'loc'、'gt'和'all'进行后过滤，以保留仅具有正值的行：

(pd.concat([df1, df2])
   # 按月份-年份排序（如果是月份-日期则使用%b-%d）
   .sort_values(by='Date', key=lambda x: pd.to_datetime(x, format='%b-%y'))
   # 重塑为宽格式，仅保留最后一个值
   .pivot_table(index='Seller', columns='Date', values='Value', aggfunc='last')
   # 仅保留所有正值的行
   .loc[lambda d: d.gt(0).all(axis=1)]
   .reset_index().rename_axis(columns=None)
)

输出:

    卖家  一月-02  一月-03
0      Mat    14.0    14.0
1    Messi     1.0    10.0
2       Om    12.0    14.0
3  Ronaldo     7.0    17.0

在'loc'过滤步骤之前的中间结果:

Date     一月-02  一月-03
Seller                 
Cat         0.0     NaN
Hari        2.0     0.0
John        0.0     0.0
Mat        14.0    14.0  # 仅这些行
Messi       1.0    10.0  # 有正值
Om         12.0    14.0  # 保留它们
Ron         NaN     0.0
Ronaldo     7.0    17.0  #

英文:

concat the DataFrames in the desired order, then use a pivot_table, getting the last value per Seller and post-filter using loc, gt and all to keep the rows with only positive values:

(pd.concat([df1, df2])
   # sort by Month-year (use %b-%d if Month-day)
   .sort_values(by=&#39;Date&#39;, key=lambda x: pd.to_datetime(x, format=&#39;%b-%y&#39;))
    # reshape to wide form, keeping only the last value
   .pivot_table(index=&#39;Seller&#39;, columns=&#39;Date&#39;, values=&#39;Vaule&#39;, aggfunc=&#39;last&#39;)
    # keep only rows with all positive values
   .loc[lambda d: d.gt(0).all(axis=1)]
   .reset_index().rename_axis(columns=None)
)

Output:

    Seller  Jan-02  Jan-03
0      Mat    14.0    14.0
1    Messi     1.0    10.0
2       Om    12.0    14.0
3  Ronaldo     7.0    17.0

intermediate before the `loc` filtering step:

Date     Jan-02  Jan-03
Seller                 
Cat         0.0     NaN
Hari        2.0     0.0
John        0.0     0.0
Mat        14.0    14.0  # those rows only
Messi       1.0    10.0  # have positive values
Om         12.0    14.0  # keep them
Ron         NaN     0.0
Ronaldo     7.0    17.0  #

答案2

得分: 2

Here are the translated code parts:

另一个可能的解决方案：

(pd.concat([df1, df2])
 .drop_duplicates('Seller', keep='last')
 .loc[lambda x: x['Value'].ne(0)]
 .dropna())

输出：

    Seller  Value    Date
1       Om   12.0  Jan-02
3      Mat   14.0  Jan-02
5    Messi   10.0  Jan-02
7  Ronaldo    7.0  Jan-02

为了满足需求：

(pd.concat([df1, df2])
 .drop_duplicates('Seller', keep='last')
 .loc[lambda x: x['Value'].ne(0)]
 .dropna()
 .merge(df1, on='Seller', suffixes=('_jan03', '_jan02'))
 .loc[:, ['Seller', 'Value_jan02', 'Value_jan03']])

输出：

    Seller  Value_jan02  Value_jan03
0       Om           12         14.0
1      Mat           14         14.0
2    Messi            1         10.0
3  Ronaldo            7         17.0

一个更通用的方法，基于reduce结合merge：

from functools import reduce

list_df = [df1, df2]

(reduce(lambda x, y:  
    pd.merge(x, y, on='Seller', suffixes=['', f"_{y['Date'][0]}"]), list_df)
 .rename({'Value': f"Value_{df1['Date'][0]}"}, axis=1)
 .loc[lambda x: x.iloc[:,-2].gt(0), lambda x: ~x.columns.str.startswith('Date')])

如果需要进一步翻译，请告诉我。

英文:

Another possible solution:

(pd.concat([df1, df2])
 .drop_duplicates(&#39;Seller&#39;,keep=&#39;last&#39;)
 .loc[lambda x: x[&#39;Vaule&#39;].ne(0)]
 .dropna())

Output:

    Seller  Value    Date
1       Om   12.0  Jan-02
3      Mat   14.0  Jan-02
5    Messi   10.0  Jan-02
7  Ronaldo    7.0  Jan-02

To answer the requirement:

(pd.concat([df1, df2])
 .drop_duplicates(&#39;Seller&#39;,keep=&#39;last&#39;)
 .loc[lambda x: x[&#39;Vaule&#39;].ne(0)]
 .dropna()
 .merge(df1, on=&#39;Seller&#39;, suffixes=(&#39;_jan03&#39;, &#39;_jan02&#39;))
 .loc[:, [&#39;Seller&#39;, &#39;Vaule_jan02&#39;, &#39;Vaule_jan03&#39;]])

Output:

    Seller  Value_jan02  Value_jan03
0       Om           12         14.0
1      Mat           14         14.0
2    Messi            1         10.0
3  Ronaldo            7         17.0

A more general approach, which is based on reduce combined with merge:

from functools import reduce

list_df = [df1, df2]

(reduce(lambda x, y:  
    pd.merge(x, y, on=&#39;Seller&#39;, suffixes=[&#39;&#39;, f&quot;_{y[&#39;Date&#39;][0]}&quot;]), list_df)
 .rename({&#39;Value&#39;: f&quot;Value_{df1[&#39;Date&#39;][0]}&quot;}, axis=1)
 .loc[lambda x: x.iloc[:,-2].gt(0), lambda x: ~x.columns.str.startswith(&#39;Date&#39;)])

答案3

得分: 1

使用 concat 来移除值小于 0 的 seller，去除重复项并进行总和聚合：

df = pd.concat([df1, Df2])

df = (df[~df['Seller'].isin(df.loc[df['Vaule'].le(0), 'Seller'])]
        .drop_duplicates()
        .groupby('Seller', as_index=False)['Vaule']
        .sum()
        )
print (df)
    Seller  Vaule
0      Mat   14.0
1    Messi   10.0
2       Om   12.0
3  Ronaldo    7.0

英文:

Use concat with remove sellers if Values are less like 0, remove duplicates and aggregate sum:

df = pd.concat([df1, Df2])

df = (df[~df[&#39;Seller&#39;].isin(df.loc[df[&#39;Vaule&#39;].le(0), &#39;Seller&#39;])]
        .drop_duplicates()
        .groupby(&#39;Seller&#39;, as_index=False)[&#39;Vaule&#39;]
        .sum()
        )
print (df)
    Seller  Vaule
0      Mat   14.0
1    Messi   10.0
2       Om   12.0
3  Ronaldo    7.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将所有数据框合并成一个单一文件。

问题

答案1

在'loc'过滤步骤之前的中间结果:

intermediate before the `loc` filtering step:

答案2

答案3

在一个 pandas 数据框中，当只有年份信息时，设置为该年的第一天。

Why is there occasionally an additional space in the output of printing the screen? I can't find it anywhere. (Python 3 Spacing Issue)

在pandas中绘制分组数据的时间序列线图。

将循环结果保存到一个变量中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

在'loc'过滤步骤之前的中间结果:

intermediate before the loc filtering step:

答案2

答案3

发表评论

intermediate before the `loc` filtering step: