2023年2月19日 07:27:42go评论106阅读模式

英文:

How to join different dataframe with specific criteria?

问题

在我的MySQL数据库stocks中，我有3个不同的表。我想要将所有这些表连接起来以显示我想要看到的精确格式。我应该首先在MySQL中进行连接，还是应该首先将每个表提取为数据框，然后使用pandas进行连接？应该如何完成？我不知道代码。

这是我想要显示的方式：https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0

因为这只是一个示例格式，它只显示了两个股票代码。预期的格式应该包括我的数据中的所有股票代码。

因此，每个股票代码都是一个包含来自我的表的所有特定列的行。

附加信息：

我只需要显示最近的8个季度的季度数据和最近的5年的年度数据。
季度数据不同股票的确切日期可能不同。如果手工完成，最近的八个季度可以轻松复制并粘贴到相应的列中，但是我不知道如何使用计算机来确定它属于哪个季度，并在与我的示例输出相同的列中显示它。（我仅使用q1到q8作为列名来显示。因此，如果我的最新数据是5月30日，则q8不一定是第二年的最后一个季度。
如果某个股票代码的最新季度或年度数据不可用（如示例中的"ADUS"），但其他股票代码（如示例中的"BA"）可用，只需将该股票代码的数据保留为空白。

第一张表company_info：https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 包含公司信息数据。

第二张表income_statement_q：https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 包含季度数据。

第三张表income_statement_y：https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 包含年度数据。

英文:

In my MySQL database stocks, I have 3 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.

This is how I want to display: https://www.dropbox.com/s/fc3mll0q3vefm3q/expected%20output%20sample.csv?dl=0

Because this is just an example format, it only shows two tickers. The expected one should include all of the tickers from my data.

So each ticker is a row that contains all of the specific columns from my tables.

Additional info:

I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.

1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:

2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:

3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:

答案1

得分: 3

以下是翻译好的代码部分：

# 如果需要，将日期列转换为 datetime64
df2['date'] = pd.to_datetime(df2['date'])  # 季度
df3['date'] = pd.to_datetime(df3['date'])  # 年度
# 根据周期重新调整日期：2022-06-30 -> 2022-12-31（年度）
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# 获取结束日期
qmax = df2['date'].max()
ymax = df3['date'].max()
# 创建日期范围（Q为8个周期，Y为5个周期）
qdti = pd.date_range(qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range(ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# 过滤和重塑数据框
qdf = (df2[df2['date'].isin(qdti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
                      .assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
                      .pivot(index='ticker', columns='date', values='netIncome'))
# 创建期望的数据框
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()

输出部分：

>>> out
  ticker                                  industry                  sector     pe    roe     shares  ...       2022Q4          2018          2019          2020          2021          2022
0   ADUS          Health Care Providers & Services             Health Care  38.06   7.56   16110400  ...          NaN  1.737700e+07  2.581100e+07  3.313300e+07  4.512600e+07           NaN
1     BA                       Aerospace & Defense             Industrials    NaN   0.00  598240000  ... -663000000.0  1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2    CAH          Health Care Providers & Services             Health Care    NaN   0.00  257639000  ... -130000000.0  2.590000e+08  1.365000e+09 -3.691000e+09  6.120000e+08 -9.320000e+08
3   CVRX          Health Care Equipment & Supplies             Health Care   0.26 -32.50   20633700  ...  -10536000.0           NaN           NaN           NaN -4.307800e+07 -4.142800e+07
4   IMCR                             Biotechnology             Health Care    NaN -22.30   47905000  ...          NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08           NaN
5   NVEC  Semiconductors & Semiconductor Equipment  Information Technology  20.09  28.10    4830800  ...    4231324.0  1.391267e+07  1.450794e+07  1.452664e+07  1.169438e+07  1.450750e+07
6   PEPG                             Biotechnology             Health Care    NaN -36.80   23631900  ...          NaN           NaN           NaN -1.889000e+06 -2.728100e+07           NaN
7   VRDN                             Biotechnology             Health Care    NaN -36.80   40248200  ...          NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07           NaN
[8 rows x 20 columns]

英文:

You can use:

# Convert as datetime64 if necessary
df2[&#39;date&#39;] = pd.to_datetime(df2[&#39;date&#39;])  # quarterly
df3[&#39;date&#39;] = pd.to_datetime(df3[&#39;date&#39;])  # yearly
# Realign date according period: 2022-06-30 -&gt; 2022-12-31 for yearly
df2[&#39;date&#39;] += pd.offsets.QuarterEnd(0)
df3[&#39;date&#39;] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2[&#39;date&#39;].max()
ymax = df3[&#39;date&#39;].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq=&#39;Q&#39;)
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq=&#39;Y&#39;)
# Filter and reshape dataframes
qdf = (df2[df2[&#39;date&#39;].isin(qdti)]
                      .assign(date=lambda x: x[&#39;date&#39;].dt.to_period(&#39;Q&#39;).astype(str))
                      .pivot(index=&#39;ticker&#39;, columns=&#39;date&#39;, values=&#39;netIncome&#39;))
ydf = (df3[df3[&#39;date&#39;].isin(ydti)]
                      .assign(date=lambda x: x[&#39;date&#39;].dt.to_period(&#39;Y&#39;).astype(str))
                      .pivot(index=&#39;ticker&#39;, columns=&#39;date&#39;, values=&#39;netIncome&#39;))
# Create the expected dataframe
out = pd.concat([df1.set_index(&#39;ticker&#39;), qdf, ydf], axis=1).reset_index()

Output:

&gt;&gt;&gt; out
  ticker                                  industry                  sector     pe    roe     shares  ...       2022Q4          2018          2019          2020          2021          2022
0   ADUS          Health Care Providers &amp; Services             Health Care  38.06   7.56   16110400  ...          NaN  1.737700e+07  2.581100e+07  3.313300e+07  4.512600e+07           NaN
1     BA                       Aerospace &amp; Defense             Industrials    NaN   0.00  598240000  ... -663000000.0  1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2    CAH          Health Care Providers &amp; Services             Health Care    NaN   0.00  257639000  ... -130000000.0  2.590000e+08  1.365000e+09 -3.691000e+09  6.120000e+08 -9.320000e+08
3   CVRX          Health Care Equipment &amp; Supplies             Health Care   0.26 -32.50   20633700  ...  -10536000.0           NaN           NaN           NaN -4.307800e+07 -4.142800e+07
4   IMCR                             Biotechnology             Health Care    NaN -22.30   47905000  ...          NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08           NaN
5   NVEC  Semiconductors &amp; Semiconductor Equipment  Information Technology  20.09  28.10    4830800  ...    4231324.0  1.391267e+07  1.450794e+07  1.452664e+07  1.169438e+07  1.450750e+07
6   PEPG                             Biotechnology             Health Care    NaN -36.80   23631900  ...          NaN           NaN           NaN -1.889000e+06 -2.728100e+07           NaN
7   VRDN                             Biotechnology             Health Care    NaN -36.80   40248200  ...          NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07           NaN
[8 rows x 20 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用特定条件连接不同的数据框？

问题

答案1

从唯一的“ID”中减去“INT”列的“LMP”列，但仅从索引行中减去。

小数精度超过最大精度，尽管小数具有正确的大小和精度。

高效且可扩展的方式处理Python中大型数据集的时间序列分析

将整数转换为给定间隔的二进制的Python函数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。