英文:
Scrape html page that has text embedded in stylesheet and woff file
问题
以下是您要翻译的内容:
"我想要抓取一个网页,但有些数据嵌入在样式表和woff文件中。
这里是链接 https://777codes.com/newtestament/mat1.html
我想要的是这里的希腊文本,但在Chrome的检查器中根本不显示
还有从这里 https://777codes.com/newtestament/gen1.html,我想要获取希伯来文本,但如果您在Chrome的检查器中查看,您会看到一些"???",在抓取时会出现
基本上,Chrome的元素检查器显示为空白或问号,但在浏览器中显示正确,所以我知道数据是存在的。
缺失的数据是希腊语和希伯来语。
我尝试了一些使用Beautiful Soup和非常简单的Selenium进行的基本抓取。它们提供了在元素检查器中看到的不正确的数据。我想获取在浏览器中看到的内容。
我理解有时JavaScript会渲染内容,但我认为这有点不同。"
英文:
I want to scrape a webpage but some data is embedded in the stylesheet and woff files.
Here are the links https://777codes.com/newtestament/mat1.html
I want the Greek text here which does not show at all in Chromes inspector
And from here https://777codes.com/newtestament/gen1.html I want to get the Hebrew text but if you look in Chromes inspector you will see some "???" which comes out in the scrape
Basically Chromes element inspector shows blank or question marks but it shows correctly in the browser so I know the data is there.
Data missing is in Greek and Hebrew language.
I tried some basic scrapes with Beautiful Soup and very simple Selenium. They give the data in the element inspector which is incorrect. I want to get what I see in the browser.
I understand that sometimes Javascript renders content but this is a bit different I think.
答案1
得分: 0
不需要转换库。我能够使用Beautiful Soup从网站提取希伯来字符。
import requests
from bs4 import BeautifulSoup
page = requests.get("https://777codes.com/newtestament/gen1.html")
soup = BeautifulSoup(page.content, "html.parser")
first_hebrew_word = soup.find("div", class_="stl_01 stl_21")
# 输出 1:1 יתꢀרא(包括希伯来字符)
print(first_hebrew_word.text)
# 如果您想要清理输出
# 复制对象以防止未来错误
word = first_hebrew_word.__copy__()
for garbage in word.find_all("span", class_="stl_22"):
# 移除垃圾
garbage.decompose()
# 输出 יתꢀראꢁ(包括希伯来字符)
print(word.text.strip())
with open("output.txt", "w") as file:
file.write(word.text.strip() + "\n")
英文:
Actually, you don't need the transliterate library. I was able to extract the hebrew chars from the site using beautiful soup.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://777codes.com/newtestament/gen1.html")
soup = BeautifulSoup(page.content, "html.parser")
first_hebrew_word = soup.find("div", class_="stl_01 stl_21")
# outputs 1:1 יתꢀרא (including hebrew chars)
print(first_hebrew_word.text)
# if you want to clean the output
# copy the object to prevent future errors
word = first_hebrew_word.__copy__()
for garbage in word.find_all("span", class_="stl_22"):
# remove garbage
garbage.decompose()
# outputs יתꢀראꢁ (including hebrew chars)
print(word.text.strip())
with open("output.txt", "w") as file:
file.write(word.text.strip() + "\n")
Output text in gedit (ubuntu linux)
Zoomed output in firefox (ubuntu linux)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论