英文:
Scraping data from a particular pdf hosted online
问题
我正在尝试从在线托管的一系列PDF中提取数据
我正在使用的代码是-
import fitz
import requests
import io
import re
url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656.pdf"]
for url in url_pdf:
# 下载PDF文件
print(url)
try:
response = requests.get(url)
pdf_file = io.BytesIO(response.content)
# 提取PDF文件的文本内容
pdf_reader = fitz.open(stream=pdf_file.read(), filetype="pdf")
text_content = ''
for page in range(pdf_reader.page_count):
text_content += pdf_reader.load_page(page).get_text()
except:
print("失败")
print(text_content)
然而,对于一些PDF文件,如以下示例,它失败-
https://livent.com/wp-content/uploads/2022/07/Livent_2021SustainabilityReport-English.pdf
https://www.minviro.com/wp-content/uploads/2021/10/Shifting-the-lens.pdf
出现这种情况的原因是什么,如何修复?
英文:
I am trying to scrap data from series of pdfs hosted online
The code I am using is-
import fitz
import requests
import io
import re
url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656.pdf"]
for url in url_pdf:
# Download the PDF file
print(url)
try:
response = requests.get(url)
pdf_file = io.BytesIO(response.content)
# Extract the text content of the PDF file
pdf_reader = fitz.open(stream=pdf_file.read(), filetype="pdf")
text_content = ''
for page in range(pdf_reader.page_count):
text_content += pdf_reader.load_page(page).get_text()
except:
print("Fail")
print(text_content)
However it fails for several pdfs such as-
https://livent.com/wp-content/uploads/2022/07/Livent_2021SustainabilityReport-English.pdf
https://www.minviro.com/wp-content/uploads/2021/10/Shifting-the-lens.pdf
etc. What could be the reason and how to fix this?
答案1
得分: 0
通过打印异常信息来查看错误情况会很有帮助,例如使用以下代码:
except Exception:
import traceback
traceback.print_exc()
continue
或者,可以简单地从代码中移除 try:
和 except ...:
语句,Python 会在终止时显示异常信息。这些信息可能有助于找出问题出在哪里。
英文:
It would be useful to see information on the error by printing out the exceptions, e.g. with:
except Exception:
import traceback
traceback.print_exc()
continue
Alternatively, simply remove the try:
and except ...:
statements from your code, and Python will show exception information for you as it terminates.
This information might be useful in figuring out what is going wrong.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论