合并两个或多个表,使用Python检查唯一值。

huangapple go评论87阅读模式
英文:

Merge two or more tables checking the unique value using python

问题

以下是翻译好的部分:

我是Python的初学者。
我必须检查2或3个表格,看看是否有一些数据缺失。
这些数据是`pandas.core.frame.DataFrame`。
我有这些表格:

    start	end	val	accn	fy	fp	form	filed	frame
    0	2016-01-01	2016-12-31	90272000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	CY2016
    6	2017-01-01	2017-12-31	110855000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    7	2017-01-01	2017-12-31	110855000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2017
    18	2018-01-01	2018-12-31	136819000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    19	2018-01-01	2018-12-31	136819000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2018
    30	2019-01-01	2019-12-31	161857000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2019
    41	2020-01-01	2020-12-31	182527000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2020
    52	2021-01-01	2021-12-31	257637000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2021
    61	2022-01-01	2022-12-31	282836000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2022

和

    start	end	val	accn	fy	fp	form	filed	frame
    0	2006-12-31	2007-12-29	39474000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	CY2007
    4	2007-12-30	2008-12-27	43251000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    5	2007-12-30	2008-12-27	43251000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    13	2008-12-28	2009-12-26	43232000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	CY2009
    15	2008-12-28	2009-12-26	43232000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    16	2008-12-28	2009-12-26	43232000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    27	2009-12-27	2010-12-25	57838000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	CY2010
    28	2009-12-27	2010-12-25	57838000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    30	2009-12-27	2010-12-25	57838000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    41	2010-12-26	2011-12-31	66504000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    42	2010-12-26	2011-12-31	66504000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    43	2010-12-26	2011-12-31	66504000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	CY2011
    54	2012-01-01	2012-12-29	65492000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    56	2012-01-01	2012-12-29	65492000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    57	2012-01-01	2012-12-29	65492000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    68	2012-12-30	2013-12-28	66415000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	CY2013
    70	2012-12-30	2013-12-28	66415000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    71	2012-12-30	2013-12-28	66415000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    82	2013-12-29	2014-12-27	66683000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	CY2014
    83	2013-12-29	2014

<details>
<summary>英文:</summary>

I am a beginner with Python.
I have to check 2 or 3 tables and see if some data is missing.
The data are `pandas.core.frame.DataFrame`.
I have these tables:

    start	end	val	accn	fy	fp	form	filed	frame
    0	2016-01-01	2016-12-31	90272000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	CY2016
    6	2017-01-01	2017-12-31	110855000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    7	2017-01-01	2017-12-31	110855000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2017
    18	2018-01-01	2018-12-31	136819000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    19	2018-01-01	2018-12-31	136819000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2018
    30	2019-01-01	2019-12-31	161857000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2019
    41	2020-01-01	2020-12-31	182527000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2020
    52	2021-01-01	2021-12-31	257637000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2021
    61	2022-01-01	2022-12-31	282836000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2022

and 

    start	end	val	accn	fy	fp	form	filed	frame
    0	2006-12-31	2007-12-29	39474000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	CY2007
    4	2007-12-30	2008-12-27	43251000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    5	2007-12-30	2008-12-27	43251000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    13	2008-12-28	2009-12-26	43232000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	CY2009
    15	2008-12-28	2009-12-26	43232000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    16	2008-12-28	2009-12-26	43232000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    27	2009-12-27	2010-12-25	57838000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	CY2010
    28	2009-12-27	2010-12-25	57838000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    30	2009-12-27	2010-12-25	57838000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    41	2010-12-26	2011-12-31	66504000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    42	2010-12-26	2011-12-31	66504000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    43	2010-12-26	2011-12-31	66504000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	CY2011
    54	2012-01-01	2012-12-29	65492000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    56	2012-01-01	2012-12-29	65492000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    57	2012-01-01	2012-12-29	65492000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    68	2012-12-30	2013-12-28	66415000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	CY2013
    70	2012-12-30	2013-12-28	66415000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    71	2012-12-30	2013-12-28	66415000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    82	2013-12-29	2014-12-27	66683000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	CY2014
    83	2013-12-29	2014-12-27	66683000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	NaN
    85	2013-12-29	2014-12-27	66683000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    96	2014-12-28	2015-12-26	63056000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2015
    97	2014-12-28	2015-12-26	63056000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	NaN
    98	2014-12-28	2015-12-26	63056000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	NaN
    109	2015-12-27	2016-12-31	62799000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2016
    110	2015-12-27	2016-12-31	62799000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	NaN
    117	2017-01-01	2017-12-30	63525000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2017

So the target is to check and merge the data from all years.
So basically I have to check the 2 tables and create a new table with all CY Years.
For example the first table start from 2016, but the second from 2007. So I need to create a table with `val` and `frame` merged and unique values. The table basically should have all years found from an API.

Example result:

    39474000000 2007
    43232000000  2009
    57838000000 2010
    ...
    110855000000     2017
    etc etc
There is a thing I have to check if both tables contain data for `the same year`(example `2017`), then I have to check which `val` is bigger and insert this in the new table.
How can  I do an iter for loo year by year and check the val value between the two tables?


</details>


# 答案1
**得分**: 1

如果您想忽略`frame`列中包含`NaN`的行,您可以在应用`max`之前,简单地连接您过滤后的数据框并按`frame`进行分组:

```python
df1 = df1.dropna(subset='frame')[['val', 'frame']]
df2 = df2.dropna(subset='frame')[['val', 'frame']]

pd.concat([
    df1[~df1['frame'].str.contains('Q')],
    df2[~df2['frame'].str.contains('Q')]
]).groupby('frame').max()

输出:

                 val
frame
CY2007   39474000000
CY2009   43232000000
CY2010   57838000000
CY2011   66504000000
CY2013   66415000000
CY2014   66683000000
CY2015   63056000000
CY2016   90272000000
CY2017  110855000000
CY2018  136819000000
CY2019  161857000000
CY2020  182527000000
CY2021  257637000000
CY2022  282836000000

编辑:不包括frame单元格中包含'Q'的行。

英文:

If you want to ignore rows where frame is NaN, you can simply concat your filtered dfs and group by frame before applying max:

df1 = df1.dropna(subset=&#39;frame&#39;)[[&#39;val&#39;, &#39;frame&#39;]]
df2 = df2.dropna(subset=&#39;frame&#39;)[[&#39;val&#39;, &#39;frame&#39;]]

pd.concat([
    df1[~df1[&#39;frame&#39;].str.contains(&#39;Q&#39;)], 
    df2[~df2[&#39;frame&#39;].str.contains(&#39;Q&#39;)]
]).groupby(&#39;frame&#39;).max()

Output:

                 val
frame
CY2007   39474000000
CY2009   43232000000
CY2010   57838000000
CY2011   66504000000
CY2013   66415000000
CY2014   66683000000
CY2015   63056000000
CY2016   90272000000
CY2017  110855000000
CY2018  136819000000
CY2019  161857000000
CY2020  182527000000
CY2021  257637000000
CY2022  282836000000

Edit: excluding frame cells containing &#39;Q&#39;.

huangapple
  • 本文由 发表于 2023年8月10日 20:22:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76875706.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定