2023年2月16日 18:42:05go评论81阅读模式

英文:

Web scraping SEC filings

问题

我正在从SEC edgar网站上爬取10Q文件。

这是链接：https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

如果我们检查它，您可以找到：

我需要提取地址"1600 Amphitheatre Parkway"，但不使用id。以下是一个使用id标签提取文本的代码片段。然而，我需要使用name标签。

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')
content = soup.find(name='d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602')
print(content.text)

我想使用name标签而不是id标签。然而，我无法使用name标签提取信息。请帮助。

查看HTML信息：

如何使用name标签而不是id标签提取内容。

谢谢。

英文:

I am working on web scraping 10Q documents from SEC edgar.

This is the url link: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

If we inspect it you can find

I need to extract 1600 Amphitheatre Parkway without using id. Below is a code snippet to extract text using id tag. However I need to se name tag.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get(&#39;https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm&#39;)
soup = BeautifulSoup(page.content, &#39;html.parser&#39;)
content = soup.find(id=&quot;d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602&quot;)
print(content.text)

Instead of id tag, I would like to use name tag. However I am not able to extract information sing name tag. Please help.

see the html information :

How to use name tag instead of id tag to extract the contents.

Thanks

答案1

得分: 1

你可以像这样根据属性值查找元素：

soup.find('html_tag', {"attribute": "value"})

所以在你的情况中，name 属性存在于 ix:nonnumeric 标签上：

content = soup.find('ix:nonnumeric', {"name": "dei:EntityAddressAddressLine1"})

英文:

You can find elements based on attribute values like this

soup.find(&#39;html_tag&#39;,{&quot;attribute&quot;:&quot;value&quot;})

So in your case, name attribute exists on ix:nonnumeric tag

content = soup.find(&#39;ix:nonnumeric&#39;,{&quot;name&quot;:&quot;dei:EntityAddressAddressLine1&quot;})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web scraping SEC filings

问题

答案1

ddg搜索时出现错误，使用fastapi的pip版本

In python how to create multiple dataclasses instances with different objects instance in the fields?

在Matplotlib中如何绘制图表但不显示它。

在Python中，将一个整数列表转换成二叉树。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。