问题

我正在尝试从在线托管的一系列PDF中提取数据
我正在使用的代码是-

import fitz
import requests
import io
import re

url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656.pdf"]
for url in url_pdf:
    # 下载PDF文件
    print(url)
    try:
        response = requests.get(url)
        pdf_file = io.BytesIO(response.content)

        # 提取PDF文件的文本内容
        pdf_reader = fitz.open(stream=pdf_file.read(), filetype="pdf")
        text_content = ''
        for page in range(pdf_reader.page_count):
            text_content += pdf_reader.load_page(page).get_text()

    except:
        print("失败")

   
print(text_content)

然而，对于一些PDF文件，如以下示例，它失败-
https://livent.com/wp-content/uploads/2022/07/Livent_2021SustainabilityReport-English.pdf

https://www.minviro.com/wp-content/uploads/2021/10/Shifting-the-lens.pdf

出现这种情况的原因是什么，如何修复？

英文:

I am trying to scrap data from series of pdfs hosted online
The code I am using is-

import fitz
import requests
import io
import re

url_pdf = [&quot;https://wcsecure.weblink.com.au/pdf/ASN/02528656.pdf&quot;]
for url in url_pdf:
    # Download the PDF file
    print(url)
    try:
        response = requests.get(url)
        pdf_file = io.BytesIO(response.content)

        # Extract the text content of the PDF file
        pdf_reader = fitz.open(stream=pdf_file.read(), filetype=&quot;pdf&quot;)
        text_content = &#39;&#39;
        for page in range(pdf_reader.page_count):
            text_content += pdf_reader.load_page(page).get_text()

    except:
        print(&quot;Fail&quot;)


print(text_content)

However it fails for several pdfs such as-
https://livent.com/wp-content/uploads/2022/07/Livent_2021SustainabilityReport-English.pdf

https://www.minviro.com/wp-content/uploads/2021/10/Shifting-the-lens.pdf

etc. What could be the reason and how to fix this?

答案1

得分: 0

通过打印异常信息来查看错误情况会很有帮助，例如使用以下代码：

except Exception:
    import traceback
    traceback.print_exc()
    continue

或者，可以简单地从代码中移除 try: 和 except ...: 语句，Python 会在终止时显示异常信息。这些信息可能有助于找出问题出在哪里。

英文:

It would be useful to see information on the error by printing out the exceptions, e.g. with:

    except Exception:
        import traceback
        traceback.print_exc()
        continue

Alternatively, simply remove the try: and except ...: statements from your code, and Python will show exception information for you as it terminates.

This information might be useful in figuring out what is going wrong.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从在线托管的特定PDF中提取数据

问题

答案1

保持纵横比的同时调整图像大小

下载需要SSL证书验证的网页在Python中

更新JSON文件中的字典

“尝试的更改与已接受的更改冲突” 错误在 Microsoft Graph Planner API 中发生。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论