2023年6月12日 14:33:18go评论94阅读模式

英文:

Nasdaq IPO data scraping

问题

这段代码似乎成功地获取了指定日期范围内每个月的数据。然而，DataFrame 中存在NaN值的问题。您想要获取IPO的数据以构建一个模型，以下是一些建议：

数据清洗：首先，您需要仔细检查爬取的数据，查看为什么会出现NaN值。这可能是因为网页结构或数据格式的变化。您可以尝试调整代码以正确解析和填充数据。
数据预处理：对于缺失的数据，您可以考虑使用填充方法，如均值、中位数或前后值来填充NaN值，以使数据更完整。
特征工程：一旦您的数据完整，您可以开始进行特征工程，即创建用于模型训练的特征。这可能涉及将日期转换为时间戳、对类别数据进行独热编码等操作。
模型训练：选择合适的机器学习或深度学习模型来训练。这将取决于您的任务，是回归、分类还是其他类型的问题。
模型评估：使用适当的评估指标来评估模型性能，例如均方误差（MSE）或准确度。
模型优化：根据评估结果，可以调整模型的超参数或尝试不同的算法以提高性能。
预测和分析：一旦训练好模型，您可以用它来进行IPO数据的预测和分析。

请注意，代码中的DataFrame需要正确填充和清理，以便进行后续的分析和建模。如果您需要帮助来解决DataFrame中的NaN值问题，请提供更多有关NaN值产生原因的信息，我将尽力提供更具体的帮助。

英文:

I'm trying to use this code to scrape IPOs data from nasdaq webpage.

The code can scrap, but the result in my DataFrame is NaN

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from time import sleep
from datetime import datetime
# Define dates
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 5, 31)
dates = pd.period_range(start_date, end_date, freq=&#39;M&#39;)
# Create an empty DataFrame
df = pd.DataFrame(columns=[&#39;Company Name&#39;, &#39;Symbol&#39;, &#39;Market&#39;, &#39;Price&#39;, &#39;Shares&#39;])
# Set the URL and headers
url = &#39;https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&amp;month=%s&#39;
headers = {&#39;User-Agent&#39;: &#39;non-profit learning project&#39;}
# Scrape IPO data for each date
for idx in dates:
    print(f&#39;Fetching data for {idx}&#39;)
    result = requests.get(url % idx, headers=headers)
    sleep(30)
    content = result.content
    
    if &#39;There is no data for this month&#39; not in str(content):
        table = pd.read_html(content)[0]
        print(table)
        df = pd.concat([df, table], ignore_index=True)
    
        soup = BeautifulSoup(content, features=&quot;lxml&quot;)
        
        links = soup.find_all(&#39;a&#39;, id=re.compile(&#39;two_column_main_content_rptPricing_company_\d&#39;))
        print(f&quot;Length of table vs length of links: {table.shape[0] - len(links)}&quot;)
        
        for link in links:
            df[&#39;Link&#39;].append(link[&#39;href&#39;])
# Print the resulting DataFrame
print(df)

##This is the result:

Fetching data for 2023-01
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
Length of table vs length of links: 1
Fetching data for 2023-02
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
Length of table vs length of links: 1
Fetching data for 2023-03
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
Length of table vs length of links: 1
Fetching data for 2023-04
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
Length of table vs length of links: 1
Fetching data for 2023-05
   Unnamed: 0  Unnamed: 1
0         NaN         NaN
Length of table vs length of links: 1
  Company Name Symbol Market Price Shares  Unnamed: 0  Unnamed: 1
0          NaN    NaN    NaN   NaN    NaN         NaN         NaN
1          NaN    NaN    NaN   NaN    NaN         NaN         NaN
2          NaN    NaN    NaN   NaN    NaN         NaN         NaN
3          NaN    NaN    NaN   NaN    NaN         NaN         NaN
4          NaN    NaN    NaN   NaN    NaN         NaN         NaN

It seems that the code successfully fetched data for each month within the specified date range. However, there are some issues with the resulting DataFrame, as indicated by the presence of NaN values in the columns.

I want the data of the IPOs to make a model, any ideas how could that be achieved? Thanks

答案1

得分: 1

不解析HTML内容，而使用公共API的部分代码：

import pandas as pd
import requests
url = 'https://api.nasdaq.com/api/ipo/calendar'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
start_date = '2023-1-1'
end_date = '2023-5-31'
periods = pd.period_range(start_date, end_date, freq='M')
dfs = []
for period in periods:
    data = requests.get(url, headers=headers, params={'date': period}).json()
    df = pd.json_normalize(data['data']['priced'], 'rows')
    dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

输出：

>>> df
            dealID proposedTickerSymbol                             companyName      proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
0   1225815-104715                 BREA                      Brera Holdings PLC        NASDAQ Capital               5.00     1,705,000  1/27/2023                 $8,525,000     Priced
1    890697-104848                  TXO               TXO Energy Partners, L.P.                  NYSE              20.00     5,000,000  1/27/2023               $100,000,000     Priced
2    405880-103426                 GNLX                            GENELUX CORP        NASDAQ Capital               6.00     2,500,000  1/26/2023                $15,000,000     Priced
3   1241592-105143                  QSG                    QuantaSing Group Ltd         NASDAQ Global              12.50     3,250,000  1/25/2023                $40,625,000     Priced
4   1225290-104329                 CVKD             Cadrenal Therapeutics, Inc.        NASDAQ Capital               5.00     1,400,000  1/20/2023                 $7,000,000     Priced
..             ...                  ...                                     ...                   ...                ...           ...        ...                        ...        ...
64  1210259-102635                  SGE       Strong Global Entertainment, Inc.              NYSE MKT               4.00     1,000,000  5/16/2023                 $4,000,000     Priced
65  1254469-106197                 SLRN                          ACELYRIN, Inc.  NASDAQ Global Select              18.00    30,000,000  5/05/2023               $540,000,000     Priced
66  1239799-104989                ALCYU  Alchemy Investments Acquisition Corp 1         NASDAQ Global              10.00    10,000,000  5/05/2023               $100,000,000     Priced
67  1243360-105271                 KVUE                             Kenvue Inc.                  NYSE              22.00   172,812,560  5/04/2023             $3,801,876,320     Priced
68  1190851-101486                GODNU            Golden Star Acquisition Corp         NASDAQ Global              10.00     6,000,000  5/02/2023                $60,000,000     Priced
[69 rows x 9 columns]

英文:

Don't parse HTML content but use the public API:

import pandas as pd
import requests
url = &#39;https://api.nasdaq.com/api/ipo/calendar&#39;
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0&#39;}
start_date = &#39;2023-1-1&#39;
end_date = &#39;2023-5-31&#39;
periods = pd.period_range(start_date, end_date, freq=&#39;M&#39;)
dfs = []
for period in periods:
data = requests.get(url, headers=headers, params={&#39;date&#39;: period}).json()
df = pd.json_normalize(data[&#39;data&#39;][&#39;priced&#39;], &#39;rows&#39;)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Output:

&gt;&gt;&gt; df
dealID proposedTickerSymbol                             companyName      proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
0   1225815-104715                 BREA                      Brera Holdings PLC        NASDAQ Capital               5.00     1,705,000  1/27/2023                 $8,525,000     Priced
1    890697-104848                  TXO               TXO Energy Partners, L.P.                  NYSE              20.00     5,000,000  1/27/2023               $100,000,000     Priced
2    405880-103426                 GNLX                            GENELUX CORP        NASDAQ Capital               6.00     2,500,000  1/26/2023                $15,000,000     Priced
3   1241592-105143                  QSG                    QuantaSing Group Ltd         NASDAQ Global              12.50     3,250,000  1/25/2023                $40,625,000     Priced
4   1225290-104329                 CVKD             Cadrenal Therapeutics, Inc.        NASDAQ Capital               5.00     1,400,000  1/20/2023                 $7,000,000     Priced
..             ...                  ...                                     ...                   ...                ...           ...        ...                        ...        ...
64  1210259-102635                  SGE       Strong Global Entertainment, Inc.              NYSE MKT               4.00     1,000,000  5/16/2023                 $4,000,000     Priced
65  1254469-106197                 SLRN                          ACELYRIN, Inc.  NASDAQ Global Select              18.00    30,000,000  5/05/2023               $540,000,000     Priced
66  1239799-104989                ALCYU  Alchemy Investments Acquisition Corp 1         NASDAQ Global              10.00    10,000,000  5/05/2023               $100,000,000     Priced
67  1243360-105271                 KVUE                             Kenvue Inc.                  NYSE              22.00   172,812,560  5/04/2023             $3,801,876,320     Priced
68  1190851-101486                GODNU            Golden Star Acquisition Corp         NASDAQ Global              10.00     6,000,000  5/02/2023                $60,000,000     Priced
[69 rows x 9 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

纳斯达克首次公开募股数据抓取

问题

答案1

绘制随时间变化的二元结果变量的散点图

Django ORM如何在WHERE语句中执行聚合子查询？

如何基于真实数据创建合成数据？

如何将内存中的动画 GIF 发送到 FastAPI 端点？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。