英文:
How to join different dataframe with specific criteria?
问题
在我的MySQL数据库stocks
中,我有3个不同的表。我想要将所有这些表连接起来以显示我想要看到的精确格式。我应该首先在MySQL中进行连接,还是应该首先将每个表提取为数据框,然后使用pandas进行连接?应该如何完成?我不知道代码。
这是我想要显示的方式:https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0
因为这只是一个示例格式,它只显示了两个股票代码。预期的格式应该包括我的数据中的所有股票代码。
因此,每个股票代码都是一个包含来自我的表的所有特定列的行。
附加信息:
-
我只需要显示最近的8个季度的季度数据和最近的5年的年度数据。
-
季度数据不同股票的确切日期可能不同。如果手工完成,最近的八个季度可以轻松复制并粘贴到相应的列中,但是我不知道如何使用计算机来确定它属于哪个季度,并在与我的示例输出相同的列中显示它。(我仅使用q1到q8作为列名来显示。因此,如果我的最新数据是5月30日,则q8不一定是第二年的最后一个季度。
-
如果某个股票代码的最新季度或年度数据不可用(如示例中的"ADUS"),但其他股票代码(如示例中的"BA")可用,只需将该股票代码的数据保留为空白。
第一张表company_info
:https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 包含公司信息数据。
第二张表income_statement_q
:https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 包含季度数据。
第三张表income_statement_y
:https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 包含年度数据。
英文:
In my MySQL database stocks
, I have 3 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0
Because this is just an example format, it only shows two tickers. The expected one should include all of the tickers from my data.
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
-
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
-
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
-
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info
: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q
: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y
: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
答案1
得分: 3
以下是翻译好的代码部分:
# 如果需要,将日期列转换为 datetime64
df2['date'] = pd.to_datetime(df2['date']) # 季度
df3['date'] = pd.to_datetime(df3['date']) # 年度
# 根据周期重新调整日期:2022-06-30 -> 2022-12-31(年度)
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# 获取结束日期
qmax = df2['date'].max()
ymax = df3['date'].max()
# 创建日期范围(Q为8个周期,Y为5个周期)
qdti = pd.date_range(qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range(ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# 过滤和重塑数据框
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# 创建期望的数据框
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
输出部分:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]
英文:
You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论