2023年5月13日 01:04:29go评论87阅读模式

英文:

How to extract text from pdf using golang?

问题

我正在尝试从一个PDF文件中提取文本，使用的是golang语言。请参考下面的代码。由于某种原因，它打印出了一些乱码（一些随机数字）。这里是PDF文件的链接。我相信可以提取文本，因为我能够从该文件中复制和粘贴文本。

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"strings"
	pdf "github.com/unidoc/unipdf/v3/model"
)

func main() {
	fmt.Println("请输入PDF文件的URL：")
	reader := bufio.NewReader(os.Stdin)
	url, err := reader.ReadString('\n')
	if err != nil {
		log.Fatal(err)
	}
	url = strings.TrimSpace(url)

	// 从URL获取PDF文件。
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()
	buf, _ := ioutil.ReadAll(resp.Body)
	pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
	if err != nil {
		log.Fatal(err)
	}

	// 解析PDF文件。
	isEncrypted, err := pdfReader.IsEncrypted()
	if err != nil {
		log.Fatal(err)
	}

	// 如果PDF文件被加密，则退出并显示错误消息。
	if isEncrypted {
		fmt.Println("错误：PDF文件已加密。")
		os.Exit(1)
	}

	// 获取页面数量。
	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		log.Fatal(err)
	}
	// 遍历页面并打印文本。
	for i := 1; i <= numPages; i++ {
		page, err := pdfReader.GetPage(i)
		if err != nil {
			log.Fatal(err)
		}
		text, err := page.GetAllContentStreams()
		if err != nil {
			log.Fatal(err)
		}
		fmt.Println(text)
	}
}

英文:

I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.

package main
import (
&quot;bufio&quot;
&quot;bytes&quot;
&quot;fmt&quot;
&quot;io/ioutil&quot;
&quot;log&quot;
&quot;net/http&quot;
&quot;os&quot;
&quot;strings&quot;
pdf &quot;github.com/unidoc/unipdf/v3/model&quot;
)
func main() {
fmt.Println(&quot;Enter URL of PDF file:&quot;)
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString(&#39;\n&#39;)
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println(&quot;Error: PDF is encrypted.&quot;)
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i &lt;= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}

答案1

得分: 3

GetAllContentStreams方法可能会返回页面中的格式、图形、图像和其他对象，这可能是打印出完全无用的内容（一些随机数字）的原因。

我们可以使用ExtractText方法来提取文本，而不是使用GetAllContentStreams。

使用这个包需要一个许可证API密钥。

https://github.com/unidoc/unipdf

这个软件包（unipdf）是一个商业产品，需要许可证码才能运行。

要在免费层级中获得计量许可证API密钥，请在https://cloud.unidoc.io上注册。

unipdf示例代码可以在这里找到。

以下是更新后的代码：

func init() {
    // 在使用库之前，请确保加载您的计量许可证API密钥。
    // 如果您需要一个密钥，可以在https://cloud.unidoc.io上注册并创建一个免费密钥。
    err := license.SetMeteredKey("your-metered-api-key")
    if err != nil {
        panic(err)
    }
}

func main() {
    //
    // 你代码中的其他块
    //

    // 遍历页面并打印文本。
    for i := 1; i <= numPages; i++ {
        pageNum := i + 1

        page, err := pdfReader.GetPage(i)
        if err != nil {
            log.Fatal(err)
        }
        ex, err := extractor.New(page)
        if err != nil {
            log.Fatal(err)
        }
        text, err := ex.ExtractText()
        if err != nil {
            log.Fatal(err)
        }

        fmt.Println("------------------------------")
        fmt.Printf("Page %d:\n", pageNum)
        fmt.Printf(text)
        fmt.Println("------------------------------")
    }
}

英文:

It is possible for GetAllContentStreams might returns formats, graphics, images, and other objects in that page and that might be the reason for printing complete garbage(some random numbers).

> GetAllContentStreams gets all the content streams for a page as one
> string

Instead of GetAllContentStreams, we can use ExtractText method to extract the text.

> ExtractText processes and extracts all text data in content streams
> and returns as a string.

And this should need a licence api key to use the package.

https://github.com/unidoc/unipdf

> This software package (unipdf) is a commercial product and requires a
> license code to operate.
>
> To Get a Metered License API Key in for free in the Free Tier, sign up
> on https://cloud.unidoc.io

The unipdf example code can be found at here

Here is the updated code

func init() {
// Make sure to load your metered License API key prior to using the library.
// If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
err := license.SetMeteredKey(&quot;your-metered-api-key&quot;)
if err != nil {
panic(err)
}
}
func main() {
//
// The other blocks in your code
//
// Iterate through pages and print text.
for i := 1; i &lt;= numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
ex, err := extractor.New(page)
if err != nil {
log.Fatal(err)
}
text, err := ex.ExtractText()
if err != nil {
log.Fatal(err)
}
fmt.Println(&quot;------------------------------&quot;)
fmt.Printf(&quot;Page %d:\n&quot;, pageNum)
fmt.Printf(text)
fmt.Println(&quot;------------------------------&quot;)
}
}

答案2

得分: 2

我找不到一个免费且功能强大的Go包来从PDF中提取文本。幸运的是，有一些免费的命令行工具可以做到这一点。

Xpdf的pdftotext是一个很有前途的选择。看一下它的输出：

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY'S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75

在Ubuntu上，可以使用以下命令安装这个工具：

$ sudo apt install poppler-utils

使用exec包可以很容易地从Go应用程序中执行它：

package main

import (
	"bytes"
	"context"
	"fmt"
	"os/exec"
)

func main() {
	// 更多选项请参阅“man pdftotext”。
	args := []string{
		"-layout",              // 尽量保持原始文本的物理布局。
		"-nopgbrk",             // 在页面之间不插入分页符（换页符）。
		"2023-04-24_BU-12.pdf", // 输入文件。
		"-",                    // 将输出发送到stdout。
	}
	cmd := exec.CommandContext(context.Background(), "pdftotext", args...)

	var buf bytes.Buffer
	cmd.Stdout = &buf

	if err := cmd.Run(); err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(buf.String())
}

英文:

I can not find a free, capable Go package to extract text from PDF. Luckily, there are some free CLI tools that can do this.

pdftotext from Xpdf is a promising choice. See its output:

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY&#39;S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75

On Ubuntu, this tool can be installed with this command:

$ sudo apt install poppler-utils

And it's easy to execute it from Go application with the exec package:

package main

import (
	&quot;bytes&quot;
	&quot;context&quot;
	&quot;fmt&quot;
	&quot;os/exec&quot;
)

func main() {
	// See &quot;man pdftotext&quot; for more options.
	args := []string{
		&quot;-layout&quot;,              // Maintain (as best as possible) the original physical layout of the text.
		&quot;-nopgbrk&quot;,             // Don&#39;t insert page breaks (form feed characters) between pages.
		&quot;2023-04-24_BU-12.pdf&quot;, // The input file.
		&quot;-&quot;,                    // Send the output to stdout.
	}
	cmd := exec.CommandContext(context.Background(), &quot;pdftotext&quot;, args...)

	var buf bytes.Buffer
	cmd.Stdout = &amp;buf

	if err := cmd.Run(); err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(buf.String())
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to extract text from pdf using golang?

问题

答案1

答案2

包不在 GOROOT 中。

Go代码构建错误，在标准包中使用了非标准导入 “fmt”，不允许导入循环。

Global array in golang

mgo NewObjectId在插入时损坏

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论