纳斯达克首次公开募股数据抓取

huangapple go评论94阅读模式
英文:

Nasdaq IPO data scraping

问题

这段代码似乎成功地获取了指定日期范围内每个月的数据。然而,DataFrame 中存在NaN值的问题。您想要获取IPO的数据以构建一个模型,以下是一些建议:

  1. 数据清洗:首先,您需要仔细检查爬取的数据,查看为什么会出现NaN值。这可能是因为网页结构或数据格式的变化。您可以尝试调整代码以正确解析和填充数据。

  2. 数据预处理:对于缺失的数据,您可以考虑使用填充方法,如均值、中位数或前后值来填充NaN值,以使数据更完整。

  3. 特征工程:一旦您的数据完整,您可以开始进行特征工程,即创建用于模型训练的特征。这可能涉及将日期转换为时间戳、对类别数据进行独热编码等操作。

  4. 模型训练:选择合适的机器学习或深度学习模型来训练。这将取决于您的任务,是回归、分类还是其他类型的问题。

  5. 模型评估:使用适当的评估指标来评估模型性能,例如均方误差(MSE)或准确度。

  6. 模型优化:根据评估结果,可以调整模型的超参数或尝试不同的算法以提高性能。

  7. 预测和分析:一旦训练好模型,您可以用它来进行IPO数据的预测和分析。

请注意,代码中的DataFrame需要正确填充和清理,以便进行后续的分析和建模。如果您需要帮助来解决DataFrame中的NaN值问题,请提供更多有关NaN值产生原因的信息,我将尽力提供更具体的帮助。

英文:

I'm trying to use this code to scrape IPOs data from nasdaq webpage.

The code can scrap, but the result in my DataFrame is NaN

  1. import pandas as pd
  2. import requests
  3. from bs4 import BeautifulSoup
  4. import re
  5. from time import sleep
  6. from datetime import datetime
  7. # Define dates
  8. start_date = datetime(2023, 1, 1)
  9. end_date = datetime(2023, 5, 31)
  10. dates = pd.period_range(start_date, end_date, freq='M')
  11. # Create an empty DataFrame
  12. df = pd.DataFrame(columns=['Company Name', 'Symbol', 'Market', 'Price', 'Shares'])
  13. # Set the URL and headers
  14. url = 'https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=%s'
  15. headers = {'User-Agent': 'non-profit learning project'}
  16. # Scrape IPO data for each date
  17. for idx in dates:
  18. print(f'Fetching data for {idx}')
  19. result = requests.get(url % idx, headers=headers)
  20. sleep(30)
  21. content = result.content
  22. if 'There is no data for this month' not in str(content):
  23. table = pd.read_html(content)[0]
  24. print(table)
  25. df = pd.concat([df, table], ignore_index=True)
  26. soup = BeautifulSoup(content, features="lxml")
  27. links = soup.find_all('a', id=re.compile('two_column_main_content_rptPricing_company_\d'))
  28. print(f"Length of table vs length of links: {table.shape[0] - len(links)}")
  29. for link in links:
  30. df['Link'].append(link['href'])
  31. # Print the resulting DataFrame
  32. print(df)

##This is the result:

  1. Fetching data for 2023-01
  2. Unnamed: 0 Unnamed: 1
  3. 0 NaN NaN
  4. Length of table vs length of links: 1
  5. Fetching data for 2023-02
  6. Unnamed: 0 Unnamed: 1
  7. 0 NaN NaN
  8. Length of table vs length of links: 1
  9. Fetching data for 2023-03
  10. Unnamed: 0 Unnamed: 1
  11. 0 NaN NaN
  12. Length of table vs length of links: 1
  13. Fetching data for 2023-04
  14. Unnamed: 0 Unnamed: 1
  15. 0 NaN NaN
  16. Length of table vs length of links: 1
  17. Fetching data for 2023-05
  18. Unnamed: 0 Unnamed: 1
  19. 0 NaN NaN
  20. Length of table vs length of links: 1
  21. Company Name Symbol Market Price Shares Unnamed: 0 Unnamed: 1
  22. 0 NaN NaN NaN NaN NaN NaN NaN
  23. 1 NaN NaN NaN NaN NaN NaN NaN
  24. 2 NaN NaN NaN NaN NaN NaN NaN
  25. 3 NaN NaN NaN NaN NaN NaN NaN
  26. 4 NaN NaN NaN NaN NaN NaN NaN

It seems that the code successfully fetched data for each month within the specified date range. However, there are some issues with the resulting DataFrame, as indicated by the presence of NaN values in the columns.

I want the data of the IPOs to make a model, any ideas how could that be achieved? Thanks

答案1

得分: 1

不解析HTML内容,而使用公共API的部分代码:

  1. import pandas as pd
  2. import requests
  3. url = 'https://api.nasdaq.com/api/ipo/calendar'
  4. headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
  5. start_date = '2023-1-1'
  6. end_date = '2023-5-31'
  7. periods = pd.period_range(start_date, end_date, freq='M')
  8. dfs = []
  9. for period in periods:
  10. data = requests.get(url, headers=headers, params={'date': period}).json()
  11. df = pd.json_normalize(data['data']['priced'], 'rows')
  12. dfs.append(df)
  13. df = pd.concat(dfs, ignore_index=True)

输出:

  1. >>> df
  2. dealID proposedTickerSymbol companyName proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
  3. 0 1225815-104715 BREA Brera Holdings PLC NASDAQ Capital 5.00 1,705,000 1/27/2023 $8,525,000 Priced
  4. 1 890697-104848 TXO TXO Energy Partners, L.P. NYSE 20.00 5,000,000 1/27/2023 $100,000,000 Priced
  5. 2 405880-103426 GNLX GENELUX CORP NASDAQ Capital 6.00 2,500,000 1/26/2023 $15,000,000 Priced
  6. 3 1241592-105143 QSG QuantaSing Group Ltd NASDAQ Global 12.50 3,250,000 1/25/2023 $40,625,000 Priced
  7. 4 1225290-104329 CVKD Cadrenal Therapeutics, Inc. NASDAQ Capital 5.00 1,400,000 1/20/2023 $7,000,000 Priced
  8. .. ... ... ... ... ... ... ... ... ...
  9. 64 1210259-102635 SGE Strong Global Entertainment, Inc. NYSE MKT 4.00 1,000,000 5/16/2023 $4,000,000 Priced
  10. 65 1254469-106197 SLRN ACELYRIN, Inc. NASDAQ Global Select 18.00 30,000,000 5/05/2023 $540,000,000 Priced
  11. 66 1239799-104989 ALCYU Alchemy Investments Acquisition Corp 1 NASDAQ Global 10.00 10,000,000 5/05/2023 $100,000,000 Priced
  12. 67 1243360-105271 KVUE Kenvue Inc. NYSE 22.00 172,812,560 5/04/2023 $3,801,876,320 Priced
  13. 68 1190851-101486 GODNU Golden Star Acquisition Corp NASDAQ Global 10.00 6,000,000 5/02/2023 $60,000,000 Priced
  14. [69 rows x 9 columns]
英文:

Don't parse HTML content but use the public API:

  1. import pandas as pd
  2. import requests
  3. url = 'https://api.nasdaq.com/api/ipo/calendar'
  4. headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
  5. start_date = '2023-1-1'
  6. end_date = '2023-5-31'
  7. periods = pd.period_range(start_date, end_date, freq='M')
  8. dfs = []
  9. for period in periods:
  10. data = requests.get(url, headers=headers, params={'date': period}).json()
  11. df = pd.json_normalize(data['data']['priced'], 'rows')
  12. dfs.append(df)
  13. df = pd.concat(dfs, ignore_index=True)

Output:

  1. >>> df
  2. dealID proposedTickerSymbol companyName proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
  3. 0 1225815-104715 BREA Brera Holdings PLC NASDAQ Capital 5.00 1,705,000 1/27/2023 $8,525,000 Priced
  4. 1 890697-104848 TXO TXO Energy Partners, L.P. NYSE 20.00 5,000,000 1/27/2023 $100,000,000 Priced
  5. 2 405880-103426 GNLX GENELUX CORP NASDAQ Capital 6.00 2,500,000 1/26/2023 $15,000,000 Priced
  6. 3 1241592-105143 QSG QuantaSing Group Ltd NASDAQ Global 12.50 3,250,000 1/25/2023 $40,625,000 Priced
  7. 4 1225290-104329 CVKD Cadrenal Therapeutics, Inc. NASDAQ Capital 5.00 1,400,000 1/20/2023 $7,000,000 Priced
  8. .. ... ... ... ... ... ... ... ... ...
  9. 64 1210259-102635 SGE Strong Global Entertainment, Inc. NYSE MKT 4.00 1,000,000 5/16/2023 $4,000,000 Priced
  10. 65 1254469-106197 SLRN ACELYRIN, Inc. NASDAQ Global Select 18.00 30,000,000 5/05/2023 $540,000,000 Priced
  11. 66 1239799-104989 ALCYU Alchemy Investments Acquisition Corp 1 NASDAQ Global 10.00 10,000,000 5/05/2023 $100,000,000 Priced
  12. 67 1243360-105271 KVUE Kenvue Inc. NYSE 22.00 172,812,560 5/04/2023 $3,801,876,320 Priced
  13. 68 1190851-101486 GODNU Golden Star Acquisition Corp NASDAQ Global 10.00 6,000,000 5/02/2023 $60,000,000 Priced
  14. [69 rows x 9 columns]

huangapple
  • 本文由 发表于 2023年6月12日 14:33:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454096.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定