如何使用特定条件连接不同的数据框?

huangapple go评论65阅读模式
英文:

How to join different dataframe with specific criteria?

问题

在我的MySQL数据库stocks中,我有3个不同的表。我想要将所有这些表连接起来以显示我想要看到的精确格式。我应该首先在MySQL中进行连接,还是应该首先将每个表提取为数据框,然后使用pandas进行连接?应该如何完成?我不知道代码。

这是我想要显示的方式:https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0

因为这只是一个示例格式,它只显示了两个股票代码。预期的格式应该包括我的数据中的所有股票代码。

因此,每个股票代码都是一个包含来自我的表的所有特定列的行。

附加信息:

  • 我只需要显示最近的8个季度的季度数据和最近的5年的年度数据。

  • 季度数据不同股票的确切日期可能不同。如果手工完成,最近的八个季度可以轻松复制并粘贴到相应的列中,但是我不知道如何使用计算机来确定它属于哪个季度,并在与我的示例输出相同的列中显示它。(我仅使用q1到q8作为列名来显示。因此,如果我的最新数据是5月30日,则q8不一定是第二年的最后一个季度。

  • 如果某个股票代码的最新季度或年度数据不可用(如示例中的"ADUS"),但其他股票代码(如示例中的"BA")可用,只需将该股票代码的数据保留为空白。

第一张表company_info:https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 包含公司信息数据。

第二张表income_statement_q:https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 包含季度数据。

第三张表income_statement_y:https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 包含年度数据。

英文:

In my MySQL database stocks, I have 3 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.

This is how I want to display: https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0

如何使用特定条件连接不同的数据框?

Because this is just an example format, it only shows two tickers. The expected one should include all of the tickers from my data.

So each ticker is a row that contains all of the specific columns from my tables.

Additional info:

  • I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed

  • The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.

  • If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.

1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:

如何使用特定条件连接不同的数据框?

2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:

如何使用特定条件连接不同的数据框?

3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:

如何使用特定条件连接不同的数据框?

答案1

得分: 3

以下是翻译好的代码部分:

# 如果需要,将日期列转换为 datetime64
df2['date'] = pd.to_datetime(df2['date'])  # 季度
df3['date'] = pd.to_datetime(df3['date'])  # 年度

# 根据周期重新调整日期:2022-06-30 -> 2022-12-31(年度)
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)

# 获取结束日期
qmax = df2['date'].max()
ymax = df3['date'].max()

# 创建日期范围(Q为8个周期,Y为5个周期)
qdti = pd.date_range(qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range(ymax - pd.offsets.YearEnd(4), ymax, freq='Y')

# 过滤和重塑数据框
qdf = (df2[df2['date'].isin(qdti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))

ydf = (df3[df3['date'].isin(ydti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))

# 创建期望的数据框
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()

输出部分:

>>> out
  ticker                                  industry                  sector     pe    roe     shares  ...       2022Q4          2018          2019          2020          2021          2022
0   ADUS          Health Care Providers & Services             Health Care  38.06   7.56   16110400  ...          NaN  1.737700e+07  2.581100e+07  3.313300e+07  4.512600e+07           NaN
1     BA                       Aerospace & Defense             Industrials    NaN   0.00  598240000  ... -663000000.0  1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2    CAH          Health Care Providers & Services             Health Care    NaN   0.00  257639000  ... -130000000.0  2.590000e+08  1.365000e+09 -3.691000e+09  6.120000e+08 -9.320000e+08
3   CVRX          Health Care Equipment & Supplies             Health Care   0.26 -32.50   20633700  ...  -10536000.0           NaN           NaN           NaN -4.307800e+07 -4.142800e+07
4   IMCR                             Biotechnology             Health Care    NaN -22.30   47905000  ...          NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08           NaN
5   NVEC  Semiconductors & Semiconductor Equipment  Information Technology  20.09  28.10    4830800  ...    4231324.0  1.391267e+07  1.450794e+07  1.452664e+07  1.169438e+07  1.450750e+07
6   PEPG                             Biotechnology             Health Care    NaN -36.80   23631900  ...          NaN           NaN           NaN -1.889000e+06 -2.728100e+07           NaN
7   VRDN                             Biotechnology             Health Care    NaN -36.80   40248200  ...          NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07           NaN

[8 rows x 20 columns]
英文:

You can use:

# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date'])  # quarterly
df3['date'] = pd.to_datetime(df3['date'])  # yearly

# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)

# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()

# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')

# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))

ydf = (df3[df3['date'].isin(ydti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))

# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()

Output:

>>> out
  ticker                                  industry                  sector     pe    roe     shares  ...       2022Q4          2018          2019          2020          2021          2022
0   ADUS          Health Care Providers & Services             Health Care  38.06   7.56   16110400  ...          NaN  1.737700e+07  2.581100e+07  3.313300e+07  4.512600e+07           NaN
1     BA                       Aerospace & Defense             Industrials    NaN   0.00  598240000  ... -663000000.0  1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2    CAH          Health Care Providers & Services             Health Care    NaN   0.00  257639000  ... -130000000.0  2.590000e+08  1.365000e+09 -3.691000e+09  6.120000e+08 -9.320000e+08
3   CVRX          Health Care Equipment & Supplies             Health Care   0.26 -32.50   20633700  ...  -10536000.0           NaN           NaN           NaN -4.307800e+07 -4.142800e+07
4   IMCR                             Biotechnology             Health Care    NaN -22.30   47905000  ...          NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08           NaN
5   NVEC  Semiconductors & Semiconductor Equipment  Information Technology  20.09  28.10    4830800  ...    4231324.0  1.391267e+07  1.450794e+07  1.452664e+07  1.169438e+07  1.450750e+07
6   PEPG                             Biotechnology             Health Care    NaN -36.80   23631900  ...          NaN           NaN           NaN -1.889000e+06 -2.728100e+07           NaN
7   VRDN                             Biotechnology             Health Care    NaN -36.80   40248200  ...          NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07           NaN

[8 rows x 20 columns]

huangapple
  • 本文由 发表于 2023年2月19日 07:27:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75497033.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定