合并两个或多个表,使用Python检查唯一值。

huangapple go评论126阅读模式
英文:

Merge two or more tables checking the unique value using python

问题

以下是翻译好的部分:

  1. 我是Python的初学者。
  2. 我必须检查23个表格,看看是否有一些数据缺失。
  3. 这些数据是`pandas.core.frame.DataFrame`
  4. 我有这些表格:
  5. start end val accn fy fp form filed frame
  6. 0 2016-01-01 2016-12-31 90272000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 CY2016
  7. 6 2017-01-01 2017-12-31 110855000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 NaN
  8. 7 2017-01-01 2017-12-31 110855000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2017
  9. 18 2018-01-01 2018-12-31 136819000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 NaN
  10. 19 2018-01-01 2018-12-31 136819000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2018
  11. 30 2019-01-01 2019-12-31 161857000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2019
  12. 41 2020-01-01 2020-12-31 182527000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2020
  13. 52 2021-01-01 2021-12-31 257637000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2021
  14. 61 2022-01-01 2022-12-31 282836000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2022
  15. start end val accn fy fp form filed frame
  16. 0 2006-12-31 2007-12-29 39474000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 CY2007
  17. 4 2007-12-30 2008-12-27 43251000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  18. 5 2007-12-30 2008-12-27 43251000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 NaN
  19. 13 2008-12-28 2009-12-26 43232000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 CY2009
  20. 15 2008-12-28 2009-12-26 43232000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  21. 16 2008-12-28 2009-12-26 43232000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 NaN
  22. 27 2009-12-27 2010-12-25 57838000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 CY2010
  23. 28 2009-12-27 2010-12-25 57838000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 NaN
  24. 30 2009-12-27 2010-12-25 57838000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  25. 41 2010-12-26 2011-12-31 66504000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 NaN
  26. 42 2010-12-26 2011-12-31 66504000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 NaN
  27. 43 2010-12-26 2011-12-31 66504000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 CY2011
  28. 54 2012-01-01 2012-12-29 65492000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 NaN
  29. 56 2012-01-01 2012-12-29 65492000000 0000077476-15-000012 2014 FY 10-K 2015-02-12 NaN
  30. 57 2012-01-01 2012-12-29 65492000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 NaN
  31. 68 2012-12-30 2013-12-28 66415000000 0000077476-16-000066 2015 FY 10-K 2016-02-11 CY2013
  32. 70 2012-12-30 2013-12-28 66415000000 0000077476-15-000012 2014 FY 10-K 2015-02-12 NaN
  33. 71 2012-12-30 2013-12-28 66415000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 NaN
  34. 82 2013-12-29 2014-12-27 66683000000 0000077476-17-000010 2016 FY 10-K 2017-02-15 CY2014
  35. 83 2013-12-29 2014
  36. <details>
  37. <summary>英文:</summary>
  38. I am a beginner with Python.
  39. I have to check 2 or 3 tables and see if some data is missing.
  40. The data are `pandas.core.frame.DataFrame`.
  41. I have these tables:
  42. start end val accn fy fp form filed frame
  43. 0 2016-01-01 2016-12-31 90272000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 CY2016
  44. 6 2017-01-01 2017-12-31 110855000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 NaN
  45. 7 2017-01-01 2017-12-31 110855000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2017
  46. 18 2018-01-01 2018-12-31 136819000000 0001652044-19-000004 2018 FY 10-K 2019-02-05 NaN
  47. 19 2018-01-01 2018-12-31 136819000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2018
  48. 30 2019-01-01 2019-12-31 161857000000 0001652044-20-000008 2019 FY 10-K 2020-02-04 CY2019
  49. 41 2020-01-01 2020-12-31 182527000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2020
  50. 52 2021-01-01 2021-12-31 257637000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2021
  51. 61 2022-01-01 2022-12-31 282836000000 0001652044-23-000016 2022 FY 10-K 2023-02-03 CY2022
  52. and
  53. start end val accn fy fp form filed frame
  54. 0 2006-12-31 2007-12-29 39474000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 CY2007
  55. 4 2007-12-30 2008-12-27 43251000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  56. 5 2007-12-30 2008-12-27 43251000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 NaN
  57. 13 2008-12-28 2009-12-26 43232000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 CY2009
  58. 15 2008-12-28 2009-12-26 43232000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  59. 16 2008-12-28 2009-12-26 43232000000 0001193125-10-036385 2009 FY 10-K 2010-02-22 NaN
  60. 27 2009-12-27 2010-12-25 57838000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 CY2010
  61. 28 2009-12-27 2010-12-25 57838000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 NaN
  62. 30 2009-12-27 2010-12-25 57838000000 0001193125-11-040427 2010 FY 10-K 2011-02-18 NaN
  63. 41 2010-12-26 2011-12-31 66504000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 NaN
  64. 42 2010-12-26 2011-12-31 66504000000 0001193125-12-081822 2011 FY 10-K 2012-02-27 NaN
  65. 43 2010-12-26 2011-12-31 66504000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 CY2011
  66. 54 2012-01-01 2012-12-29 65492000000 0001445305-13-000278 2012 FY 10-K 2013-02-21 NaN
  67. 56 2012-01-01 2012-12-29 65492000000 0000077476-15-000012 2014 FY 10-K 2015-02-12 NaN
  68. 57 2012-01-01 2012-12-29 65492000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 NaN
  69. 68 2012-12-30 2013-12-28 66415000000 0000077476-16-000066 2015 FY 10-K 2016-02-11 CY2013
  70. 70 2012-12-30 2013-12-28 66415000000 0000077476-15-000012 2014 FY 10-K 2015-02-12 NaN
  71. 71 2012-12-30 2013-12-28 66415000000 0000077476-14-000007 2013 FY 10-K 2014-02-14 NaN
  72. 82 2013-12-29 2014-12-27 66683000000 0000077476-17-000010 2016 FY 10-K 2017-02-15 CY2014
  73. 83 2013-12-29 2014-12-27 66683000000 0000077476-16-000066 2015 FY 10-K 2016-02-11 NaN
  74. 85 2013-12-29 2014-12-27 66683000000 0000077476-15-000012 2014 FY 10-K 2015-02-12 NaN
  75. 96 2014-12-28 2015-12-26 63056000000 0000077476-18-000012 2017 FY 10-K 2018-02-13 CY2015
  76. 97 2014-12-28 2015-12-26 63056000000 0000077476-17-000010 2016 FY 10-K 2017-02-15 NaN
  77. 98 2014-12-28 2015-12-26 63056000000 0000077476-16-000066 2015 FY 10-K 2016-02-11 NaN
  78. 109 2015-12-27 2016-12-31 62799000000 0000077476-18-000012 2017 FY 10-K 2018-02-13 CY2016
  79. 110 2015-12-27 2016-12-31 62799000000 0000077476-17-000010 2016 FY 10-K 2017-02-15 NaN
  80. 117 2017-01-01 2017-12-30 63525000000 0000077476-18-000012 2017 FY 10-K 2018-02-13 CY2017
  81. So the target is to check and merge the data from all years.
  82. So basically I have to check the 2 tables and create a new table with all CY Years.
  83. For example the first table start from 2016, but the second from 2007. So I need to create a table with `val` and `frame` merged and unique values. The table basically should have all years found from an API.
  84. Example result:
  85. 39474000000 2007
  86. 43232000000 2009
  87. 57838000000 2010
  88. ...
  89. 110855000000 2017
  90. etc etc
  91. There is a thing I have to check if both tables contain data for `the same year`(example `2017`), then I have to check which `val` is bigger and insert this in the new table.
  92. How can I do an iter for loo year by year and check the val value between the two tables?
  93. </details>
  94. # 答案1
  95. **得分**: 1
  96. 如果您想忽略`frame`列中包含`NaN`的行,您可以在应用`max`之前,简单地连接您过滤后的数据框并按`frame`进行分组:
  97. ```python
  98. df1 = df1.dropna(subset='frame')[['val', 'frame']]
  99. df2 = df2.dropna(subset='frame')[['val', 'frame']]
  100. pd.concat([
  101. df1[~df1['frame'].str.contains('Q')],
  102. df2[~df2['frame'].str.contains('Q')]
  103. ]).groupby('frame').max()

输出:

  1. val
  2. frame
  3. CY2007 39474000000
  4. CY2009 43232000000
  5. CY2010 57838000000
  6. CY2011 66504000000
  7. CY2013 66415000000
  8. CY2014 66683000000
  9. CY2015 63056000000
  10. CY2016 90272000000
  11. CY2017 110855000000
  12. CY2018 136819000000
  13. CY2019 161857000000
  14. CY2020 182527000000
  15. CY2021 257637000000
  16. CY2022 282836000000

编辑:不包括frame单元格中包含'Q'的行。

英文:

If you want to ignore rows where frame is NaN, you can simply concat your filtered dfs and group by frame before applying max:

  1. df1 = df1.dropna(subset=&#39;frame&#39;)[[&#39;val&#39;, &#39;frame&#39;]]
  2. df2 = df2.dropna(subset=&#39;frame&#39;)[[&#39;val&#39;, &#39;frame&#39;]]
  3. pd.concat([
  4. df1[~df1[&#39;frame&#39;].str.contains(&#39;Q&#39;)],
  5. df2[~df2[&#39;frame&#39;].str.contains(&#39;Q&#39;)]
  6. ]).groupby(&#39;frame&#39;).max()

Output:

  1. val
  2. frame
  3. CY2007 39474000000
  4. CY2009 43232000000
  5. CY2010 57838000000
  6. CY2011 66504000000
  7. CY2013 66415000000
  8. CY2014 66683000000
  9. CY2015 63056000000
  10. CY2016 90272000000
  11. CY2017 110855000000
  12. CY2018 136819000000
  13. CY2019 161857000000
  14. CY2020 182527000000
  15. CY2021 257637000000
  16. CY2022 282836000000

Edit: excluding frame cells containing &#39;Q&#39;.

huangapple
  • 本文由 发表于 2023年8月10日 20:22:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76875706.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定