How to extract text from pdf using golang?

huangapple go评论81阅读模式
英文:

How to extract text from pdf using golang?

问题

我正在尝试从一个PDF文件中提取文本,使用的是golang语言。请参考下面的代码。由于某种原因,它打印出了一些乱码(一些随机数字)。这里是PDF文件的链接。我相信可以提取文本,因为我能够从该文件中复制和粘贴文本。

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"strings"
	pdf "github.com/unidoc/unipdf/v3/model"
)

func main() {
	fmt.Println("请输入PDF文件的URL:")
	reader := bufio.NewReader(os.Stdin)
	url, err := reader.ReadString('\n')
	if err != nil {
		log.Fatal(err)
	}
	url = strings.TrimSpace(url)

	// 从URL获取PDF文件。
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()
	buf, _ := ioutil.ReadAll(resp.Body)
	pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
	if err != nil {
		log.Fatal(err)
	}

	// 解析PDF文件。
	isEncrypted, err := pdfReader.IsEncrypted()
	if err != nil {
		log.Fatal(err)
	}

	// 如果PDF文件被加密,则退出并显示错误消息。
	if isEncrypted {
		fmt.Println("错误:PDF文件已加密。")
		os.Exit(1)
	}

	// 获取页面数量。
	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		log.Fatal(err)
	}
	// 遍历页面并打印文本。
	for i := 1; i <= numPages; i++ {
		page, err := pdfReader.GetPage(i)
		if err != nil {
			log.Fatal(err)
		}
		text, err := page.GetAllContentStreams()
		if err != nil {
			log.Fatal(err)
		}
		fmt.Println(text)
	}
}
英文:

I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.

package main
import (
&quot;bufio&quot;
&quot;bytes&quot;
&quot;fmt&quot;
&quot;io/ioutil&quot;
&quot;log&quot;
&quot;net/http&quot;
&quot;os&quot;
&quot;strings&quot;
pdf &quot;github.com/unidoc/unipdf/v3/model&quot;
)
func main() {
fmt.Println(&quot;Enter URL of PDF file:&quot;)
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString(&#39;\n&#39;)
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println(&quot;Error: PDF is encrypted.&quot;)
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i &lt;= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}

答案1

得分: 3

GetAllContentStreams方法可能会返回页面中的格式、图形、图像和其他对象,这可能是打印出完全无用的内容(一些随机数字)的原因。

我们可以使用ExtractText方法来提取文本,而不是使用GetAllContentStreams

使用这个包需要一个许可证API密钥。

https://github.com/unidoc/unipdf

这个软件包(unipdf)是一个商业产品,需要许可证码才能运行。

要在免费层级中获得计量许可证API密钥,请在https://cloud.unidoc.io上注册。

unipdf示例代码可以在这里找到。

以下是更新后的代码:

func init() {
    // 在使用库之前,请确保加载您的计量许可证API密钥。
    // 如果您需要一个密钥,可以在https://cloud.unidoc.io上注册并创建一个免费密钥。
    err := license.SetMeteredKey("your-metered-api-key")
    if err != nil {
        panic(err)
    }
}

func main() {
    //
    // 你代码中的其他块
    //

    // 遍历页面并打印文本。
    for i := 1; i <= numPages; i++ {
        pageNum := i + 1

        page, err := pdfReader.GetPage(i)
        if err != nil {
            log.Fatal(err)
        }
        ex, err := extractor.New(page)
        if err != nil {
            log.Fatal(err)
        }
        text, err := ex.ExtractText()
        if err != nil {
            log.Fatal(err)
        }

        fmt.Println("------------------------------")
        fmt.Printf("Page %d:\n", pageNum)
        fmt.Printf(text)
        fmt.Println("------------------------------")
    }
}
英文:

It is possible for GetAllContentStreams might returns formats, graphics, images, and other objects in that page and that might be the reason for printing complete garbage(some random numbers).

> GetAllContentStreams gets all the content streams for a page as one
> string

Instead of GetAllContentStreams, we can use ExtractText method to extract the text.

> ExtractText processes and extracts all text data in content streams
> and returns as a string.

And this should need a licence api key to use the package.

https://github.com/unidoc/unipdf

> This software package (unipdf) is a commercial product and requires a
> license code to operate.
>
> To Get a Metered License API Key in for free in the Free Tier, sign up
> on https://cloud.unidoc.io

The unipdf example code can be found at here

Here is the updated code

func init() {
// Make sure to load your metered License API key prior to using the library.
// If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
err := license.SetMeteredKey(&quot;your-metered-api-key&quot;)
if err != nil {
panic(err)
}
}
func main() {
//
// The other blocks in your code
//
// Iterate through pages and print text.
for i := 1; i &lt;= numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
ex, err := extractor.New(page)
if err != nil {
log.Fatal(err)
}
text, err := ex.ExtractText()
if err != nil {
log.Fatal(err)
}
fmt.Println(&quot;------------------------------&quot;)
fmt.Printf(&quot;Page %d:\n&quot;, pageNum)
fmt.Printf(text)
fmt.Println(&quot;------------------------------&quot;)
}
}

答案2

得分: 2

我找不到一个免费且功能强大的Go包来从PDF中提取文本。幸运的是,有一些免费的命令行工具可以做到这一点。

Xpdfpdftotext是一个很有前途的选择。看一下它的输出:

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY'S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75

在Ubuntu上,可以使用以下命令安装这个工具:

$ sudo apt install poppler-utils

使用exec包可以很容易地从Go应用程序中执行它:

package main

import (
	"bytes"
	"context"
	"fmt"
	"os/exec"
)

func main() {
	// 更多选项请参阅“man pdftotext”。
	args := []string{
		"-layout",              // 尽量保持原始文本的物理布局。
		"-nopgbrk",             // 在页面之间不插入分页符(换页符)。
		"2023-04-24_BU-12.pdf", // 输入文件。
		"-",                    // 将输出发送到stdout。
	}
	cmd := exec.CommandContext(context.Background(), "pdftotext", args...)

	var buf bytes.Buffer
	cmd.Stdout = &buf

	if err := cmd.Run(); err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(buf.String())
}
英文:

I can not find a free, capable Go package to extract text from PDF. Luckily, there are some free CLI tools that can do this.

pdftotext from Xpdf is a promising choice. See its output:

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY&#39;S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75

On Ubuntu, this tool can be installed with this command:

$ sudo apt install poppler-utils

And it's easy to execute it from Go application with the exec package:

package main

import (
	&quot;bytes&quot;
	&quot;context&quot;
	&quot;fmt&quot;
	&quot;os/exec&quot;
)

func main() {
	// See &quot;man pdftotext&quot; for more options.
	args := []string{
		&quot;-layout&quot;,              // Maintain (as best as possible) the original physical layout of the text.
		&quot;-nopgbrk&quot;,             // Don&#39;t insert page breaks (form feed characters) between pages.
		&quot;2023-04-24_BU-12.pdf&quot;, // The input file.
		&quot;-&quot;,                    // Send the output to stdout.
	}
	cmd := exec.CommandContext(context.Background(), &quot;pdftotext&quot;, args...)

	var buf bytes.Buffer
	cmd.Stdout = &amp;buf

	if err := cmd.Run(); err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(buf.String())
}

huangapple
  • 本文由 发表于 2023年5月13日 01:04:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76238545.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定