Web scraping SEC filings

huangapple go评论62阅读模式
英文:

Web scraping SEC filings

问题

我正在从SEC edgar网站上爬取10Q文件。

这是链接:https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

如果我们检查它,您可以找到:

我需要提取地址"1600 Amphitheatre Parkway",但不使用id。以下是一个使用id标签提取文本的代码片段。然而,我需要使用name标签。

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')

content = soup.find(name='d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602')
print(content.text)

我想使用name标签而不是id标签。然而,我无法使用name标签提取信息。请帮助。

查看HTML信息:

如何使用name标签而不是id标签提取内容。

谢谢。

英文:

I am working on web scraping 10Q documents from SEC edgar.

This is the url link: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

If we inspect it you can find Web scraping SEC filings

I need to extract 1600 Amphitheatre Parkway without using id. Below is a code snippet to extract text using id tag. However I need to se name tag.

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')

content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)

Instead of id tag, I would like to use name tag. However I am not able to extract information sing name tag. Please help.

see the html information :

Web scraping SEC filings

How to use name tag instead of id tag to extract the contents.

Thanks

答案1

得分: 1

你可以像这样根据属性值查找元素:

soup.find('html_tag', {"attribute": "value"})

所以在你的情况中,name 属性存在于 ix:nonnumeric 标签上:

content = soup.find('ix:nonnumeric', {"name": "dei:EntityAddressAddressLine1"})
英文:

You can find elements based on attribute values like this

soup.find('html_tag',{"attribute":"value"})

So in your case, name attribute exists on ix:nonnumeric tag

content = soup.find('ix:nonnumeric',{"name":"dei:EntityAddressAddressLine1"})

huangapple
  • 本文由 发表于 2023年2月16日 18:42:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75471094.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定