英文:
Nasdaq IPO data scraping
问题
这段代码似乎成功地获取了指定日期范围内每个月的数据。然而,DataFrame 中存在NaN值的问题。您想要获取IPO的数据以构建一个模型,以下是一些建议:
-
数据清洗:首先,您需要仔细检查爬取的数据,查看为什么会出现NaN值。这可能是因为网页结构或数据格式的变化。您可以尝试调整代码以正确解析和填充数据。
-
数据预处理:对于缺失的数据,您可以考虑使用填充方法,如均值、中位数或前后值来填充NaN值,以使数据更完整。
-
特征工程:一旦您的数据完整,您可以开始进行特征工程,即创建用于模型训练的特征。这可能涉及将日期转换为时间戳、对类别数据进行独热编码等操作。
-
模型训练:选择合适的机器学习或深度学习模型来训练。这将取决于您的任务,是回归、分类还是其他类型的问题。
-
模型评估:使用适当的评估指标来评估模型性能,例如均方误差(MSE)或准确度。
-
模型优化:根据评估结果,可以调整模型的超参数或尝试不同的算法以提高性能。
-
预测和分析:一旦训练好模型,您可以用它来进行IPO数据的预测和分析。
请注意,代码中的DataFrame需要正确填充和清理,以便进行后续的分析和建模。如果您需要帮助来解决DataFrame中的NaN值问题,请提供更多有关NaN值产生原因的信息,我将尽力提供更具体的帮助。
英文:
I'm trying to use this code to scrape IPOs data from nasdaq webpage.
The code can scrap, but the result in my DataFrame is NaN
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from time import sleep
from datetime import datetime
# Define dates
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 5, 31)
dates = pd.period_range(start_date, end_date, freq='M')
# Create an empty DataFrame
df = pd.DataFrame(columns=['Company Name', 'Symbol', 'Market', 'Price', 'Shares'])
# Set the URL and headers
url = 'https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=%s'
headers = {'User-Agent': 'non-profit learning project'}
# Scrape IPO data for each date
for idx in dates:
print(f'Fetching data for {idx}')
result = requests.get(url % idx, headers=headers)
sleep(30)
content = result.content
if 'There is no data for this month' not in str(content):
table = pd.read_html(content)[0]
print(table)
df = pd.concat([df, table], ignore_index=True)
soup = BeautifulSoup(content, features="lxml")
links = soup.find_all('a', id=re.compile('two_column_main_content_rptPricing_company_\d'))
print(f"Length of table vs length of links: {table.shape[0] - len(links)}")
for link in links:
df['Link'].append(link['href'])
# Print the resulting DataFrame
print(df)
##This is the result:
Fetching data for 2023-01
Unnamed: 0 Unnamed: 1
0 NaN NaN
Length of table vs length of links: 1
Fetching data for 2023-02
Unnamed: 0 Unnamed: 1
0 NaN NaN
Length of table vs length of links: 1
Fetching data for 2023-03
Unnamed: 0 Unnamed: 1
0 NaN NaN
Length of table vs length of links: 1
Fetching data for 2023-04
Unnamed: 0 Unnamed: 1
0 NaN NaN
Length of table vs length of links: 1
Fetching data for 2023-05
Unnamed: 0 Unnamed: 1
0 NaN NaN
Length of table vs length of links: 1
Company Name Symbol Market Price Shares Unnamed: 0 Unnamed: 1
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
It seems that the code successfully fetched data for each month within the specified date range. However, there are some issues with the resulting DataFrame, as indicated by the presence of NaN values in the columns.
I want the data of the IPOs to make a model, any ideas how could that be achieved? Thanks
答案1
得分: 1
不解析HTML内容,而使用公共API的部分代码:
import pandas as pd
import requests
url = 'https://api.nasdaq.com/api/ipo/calendar'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
start_date = '2023-1-1'
end_date = '2023-5-31'
periods = pd.period_range(start_date, end_date, freq='M')
dfs = []
for period in periods:
data = requests.get(url, headers=headers, params={'date': period}).json()
df = pd.json_normalize(data['data']['priced'], 'rows')
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
输出:
>>> df
dealID proposedTickerSymbol companyName proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
0 1225815-104715 BREA Brera Holdings PLC NASDAQ Capital 5.00 1,705,000 1/27/2023 $8,525,000 Priced
1 890697-104848 TXO TXO Energy Partners, L.P. NYSE 20.00 5,000,000 1/27/2023 $100,000,000 Priced
2 405880-103426 GNLX GENELUX CORP NASDAQ Capital 6.00 2,500,000 1/26/2023 $15,000,000 Priced
3 1241592-105143 QSG QuantaSing Group Ltd NASDAQ Global 12.50 3,250,000 1/25/2023 $40,625,000 Priced
4 1225290-104329 CVKD Cadrenal Therapeutics, Inc. NASDAQ Capital 5.00 1,400,000 1/20/2023 $7,000,000 Priced
.. ... ... ... ... ... ... ... ... ...
64 1210259-102635 SGE Strong Global Entertainment, Inc. NYSE MKT 4.00 1,000,000 5/16/2023 $4,000,000 Priced
65 1254469-106197 SLRN ACELYRIN, Inc. NASDAQ Global Select 18.00 30,000,000 5/05/2023 $540,000,000 Priced
66 1239799-104989 ALCYU Alchemy Investments Acquisition Corp 1 NASDAQ Global 10.00 10,000,000 5/05/2023 $100,000,000 Priced
67 1243360-105271 KVUE Kenvue Inc. NYSE 22.00 172,812,560 5/04/2023 $3,801,876,320 Priced
68 1190851-101486 GODNU Golden Star Acquisition Corp NASDAQ Global 10.00 6,000,000 5/02/2023 $60,000,000 Priced
[69 rows x 9 columns]
英文:
Don't parse HTML content but use the public API:
import pandas as pd
import requests
url = 'https://api.nasdaq.com/api/ipo/calendar'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
start_date = '2023-1-1'
end_date = '2023-5-31'
periods = pd.period_range(start_date, end_date, freq='M')
dfs = []
for period in periods:
data = requests.get(url, headers=headers, params={'date': period}).json()
df = pd.json_normalize(data['data']['priced'], 'rows')
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
Output:
>>> df
dealID proposedTickerSymbol companyName proposedExchange proposedSharePrice sharesOffered pricedDate dollarValueOfSharesOffered dealStatus
0 1225815-104715 BREA Brera Holdings PLC NASDAQ Capital 5.00 1,705,000 1/27/2023 $8,525,000 Priced
1 890697-104848 TXO TXO Energy Partners, L.P. NYSE 20.00 5,000,000 1/27/2023 $100,000,000 Priced
2 405880-103426 GNLX GENELUX CORP NASDAQ Capital 6.00 2,500,000 1/26/2023 $15,000,000 Priced
3 1241592-105143 QSG QuantaSing Group Ltd NASDAQ Global 12.50 3,250,000 1/25/2023 $40,625,000 Priced
4 1225290-104329 CVKD Cadrenal Therapeutics, Inc. NASDAQ Capital 5.00 1,400,000 1/20/2023 $7,000,000 Priced
.. ... ... ... ... ... ... ... ... ...
64 1210259-102635 SGE Strong Global Entertainment, Inc. NYSE MKT 4.00 1,000,000 5/16/2023 $4,000,000 Priced
65 1254469-106197 SLRN ACELYRIN, Inc. NASDAQ Global Select 18.00 30,000,000 5/05/2023 $540,000,000 Priced
66 1239799-104989 ALCYU Alchemy Investments Acquisition Corp 1 NASDAQ Global 10.00 10,000,000 5/05/2023 $100,000,000 Priced
67 1243360-105271 KVUE Kenvue Inc. NYSE 22.00 172,812,560 5/04/2023 $3,801,876,320 Priced
68 1190851-101486 GODNU Golden Star Acquisition Corp NASDAQ Global 10.00 6,000,000 5/02/2023 $60,000,000 Priced
[69 rows x 9 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论