2023年2月8日 23:39:03go评论230阅读模式

英文:

Scrape html page that has text embedded in stylesheet and woff file

问题

以下是您要翻译的内容：

"我想要抓取一个网页，但有些数据嵌入在样式表和woff文件中。

这里是链接 https://777codes.com/newtestament/mat1.html
我想要的是这里的希腊文本，但在Chrome的检查器中根本不显示

还有从这里 https://777codes.com/newtestament/gen1.html，我想要获取希伯来文本，但如果您在Chrome的检查器中查看，您会看到一些"???"，在抓取时会出现

基本上，Chrome的元素检查器显示为空白或问号，但在浏览器中显示正确，所以我知道数据是存在的。

缺失的数据是希腊语和希伯来语。

我尝试了一些使用Beautiful Soup和非常简单的Selenium进行的基本抓取。它们提供了在元素检查器中看到的不正确的数据。我想获取在浏览器中看到的内容。

我理解有时JavaScript会渲染内容，但我认为这有点不同。"

英文:

I want to scrape a webpage but some data is embedded in the stylesheet and woff files.

Here are the links https://777codes.com/newtestament/mat1.html
I want the Greek text here which does not show at all in Chromes inspector

And from here https://777codes.com/newtestament/gen1.html I want to get the Hebrew text but if you look in Chromes inspector you will see some "???" which comes out in the scrape

Basically Chromes element inspector shows blank or question marks but it shows correctly in the browser so I know the data is there.

Data missing is in Greek and Hebrew language.

I tried some basic scrapes with Beautiful Soup and very simple Selenium. They give the data in the element inspector which is incorrect. I want to get what I see in the browser.

I understand that sometimes Javascript renders content but this is a bit different I think.

答案1

得分: 0

不需要转换库。我能够使用Beautiful Soup从网站提取希伯来字符。

import requests
from bs4 import BeautifulSoup

page = requests.get("https://777codes.com/newtestament/gen1.html")
soup = BeautifulSoup(page.content, "html.parser")

first_hebrew_word = soup.find("div", class_="stl_01 stl_21")

# 输出 1:1 יתꢀרא（包括希伯来字符）
print(first_hebrew_word.text)

# 如果您想要清理输出

# 复制对象以防止未来错误
word = first_hebrew_word.__copy__()
for garbage in word.find_all("span", class_="stl_22"):
    # 移除垃圾
    garbage.decompose()

# 输出 יתꢀראꢁ（包括希伯来字符）
print(word.text.strip())

with open("output.txt", "w") as file:
    file.write(word.text.strip() + "\n")

在Ubuntu Linux的gedit中的输出文本

在Ubuntu Linux中Firefox中的放大输出

英文:

Actually, you don't need the transliterate library. I was able to extract the hebrew chars from the site using beautiful soup.

import requests
from bs4 import BeautifulSoup

page = requests.get(&quot;https://777codes.com/newtestament/gen1.html&quot;)
soup = BeautifulSoup(page.content, &quot;html.parser&quot;)

first_hebrew_word = soup.find(&quot;div&quot;, class_=&quot;stl_01 stl_21&quot;)

# outputs 1:1&#160;יתꢀרא (including hebrew chars)
print(first_hebrew_word.text)

# if you want to clean the output

# copy the object to prevent future errors
word = first_hebrew_word.__copy__()
for garbage in word.find_all(&quot;span&quot;, class_=&quot;stl_22&quot;):
    # remove garbage
    garbage.decompose()

# outputs יתꢀראꢁ (including hebrew chars)
print(word.text.strip())

with open(&quot;output.txt&quot;, &quot;w&quot;) as file:
    file.write(word.text.strip() + &quot;\n&quot;)

Output text in gedit (ubuntu linux)

Zoomed output in firefox (ubuntu linux)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

抓取包含在样式表和woff文件中的文本的HTML页面。

问题

答案1

使用Python的requests库登录Reddit。

网页抓取循环

go-colly返回空切片

如何使用Selenium和C#从一个隐藏的网站获取表格数据？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论