如何在Go语言中将并发网络爬虫的结果输出到CSV文件?

huangapple go评论88阅读模式
英文:

How to output results to CSV of a concurrent web scraper in Go?

问题

我是新手Go语言,正在尝试利用Go的并发性来构建一个基本的爬虫,从URL中提取标题、元描述和元关键字。

我能够通过并发将结果打印到终端,但是无法弄清楚如何将输出写入CSV文件。我尝试了许多我能想到的变体,但对Go的了解有限,很多都会破坏并发性 - 所以有点困惑。

下面是我的代码和URL输入文件 - 提前感谢任何提示!

// 文件名: metascraper.go
package main

import (
	// 导入标准库
	"encoding/csv"
	"fmt"
	"io"
	"log"
	"os"
	"time"
	// 导入第三方库
	"github.com/PuerkitoBio/goquery"
)

func csvParsing() {
	file, err := os.Open("data/sample.csv")
	checkError("无法打开文件", err)

	if err != nil {
		// err是可打印的
		// 元素之间自动用空格分隔
		fmt.Println("错误:", err)
		return
	}

	// 在当前方法结束时自动调用Close()
	defer file.Close()
	//
	reader := csv.NewReader(file)
	// 可用的选项:
	// http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
	reader.Comma = ';'
	lineCount := 0

	fileWrite, err := os.Create("data/result.csv")
	checkError("无法创建文件", err)
	defer fileWrite.Close()

	writer := csv.NewWriter(fileWrite)
	defer writer.Flush()

	for {
		// 读取一条记录
		record, err := reader.Read()
		// 文件末尾被放入err中
		if err == io.EOF {
			break
		} else if err != nil {
			fmt.Println("错误:", err)
			return
		}

		go func(url string) {
			// fmt.Println(msg)
			doc, err := goquery.NewDocument(url)
			if err != nil {
				checkError("没有URL", err)
			}

			metaDescription := make(chan string, 1)
			pageTitle := make(chan string, 1)

			go func() {
				// time.Sleep(time.Second * 2)
				// 使用浏览器检查器找到的CSS选择器
				// 对于每个选择器,使用索引和项
				pageTitle <- doc.Find("title").Contents().Text()

				doc.Find("meta").Each(func(index int, item *goquery.Selection) {
					if item.AttrOr("name", "") == "description" {
						metaDescription <- item.AttrOr("content", "")
					}
				})
			}()
			select {
			case res := <-metaDescription:
				resTitle := <-pageTitle
				fmt.Println(res)
				fmt.Println(resTitle)

				// 我一直在尝试在这里输出到CSV,但不起作用

				// writer.Write([]string{url, resTitle, res})
				// err := writer.WriteString(`res`)
				// checkError("无法写入文件", err)

			case <-time.After(time.Second * 2):
				fmt.Println("超时2秒")
			}

		}(record[0])

		fmt.Println()

		lineCount++
	}
}

func main() {

	csvParsing()

	// 代码用于确保程序完成之前有暂停,以便我们可以看到输出
	var input string
	fmt.Scanln(&input)
}

func checkError(message string, err error) {
	if err != nil {
		log.Fatal(message, err)
	}
}

data/sample.csv 输入文件中的URL:

http://jonathanmh.com
http://keshavmalani.com
http://google.com
http://bing.com
http://facebook.com
英文:

I'm new to Go and am trying to take advantage of the concurrency in Go to build a basic scraper to pull extract title, meta description, and meta keywords from URLs.

I am able to print out the results to terminal with the concurrency but can't figure out how to write output to CSV. I've tried many a variations that I could think of with limited knowledge of Go and many end up breaking the concurrency - so losing my mind a bit.

My code and URL input file is below - Thanks in advance for any tips!

// file name: metascraper.go
package main
import (
// import standard libraries
&quot;encoding/csv&quot;
&quot;fmt&quot;
&quot;io&quot;
&quot;log&quot;
&quot;os&quot;
&quot;time&quot;
// import third party libraries
&quot;github.com/PuerkitoBio/goquery&quot;
)
func csvParsing() {
file, err := os.Open(&quot;data/sample.csv&quot;)
checkError(&quot;Cannot open file &quot;, err)
if err != nil {
// err is printable
// elements passed are separated by space automatically
fmt.Println(&quot;Error:&quot;, err)
return
}
// automatically call Close() at the end of current method
defer file.Close()
//
reader := csv.NewReader(file)
// options are available at:
// http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
reader.Comma = &#39;;&#39;
lineCount := 0
fileWrite, err := os.Create(&quot;data/result.csv&quot;)
checkError(&quot;Cannot create file&quot;, err)
defer fileWrite.Close()
writer := csv.NewWriter(fileWrite)
defer writer.Flush()
for {
// read just one record
record, err := reader.Read()
// end-of-file is fitted into err
if err == io.EOF {
break
} else if err != nil {
fmt.Println(&quot;Error:&quot;, err)
return
}
go func(url string) {
// fmt.Println(msg)
doc, err := goquery.NewDocument(url)
if err != nil {
checkError(&quot;No URL&quot;, err)
}
metaDescription := make(chan string, 1)
pageTitle := make(chan string, 1)
go func() {
// time.Sleep(time.Second * 2)
// use CSS selector found with the browser inspector
// for each, use index and item
pageTitle &lt;- doc.Find(&quot;title&quot;).Contents().Text()
doc.Find(&quot;meta&quot;).Each(func(index int, item *goquery.Selection) {
if item.AttrOr(&quot;name&quot;, &quot;&quot;) == &quot;description&quot; {
metaDescription &lt;- item.AttrOr(&quot;content&quot;, &quot;&quot;)
}
})
}()
select {
case res := &lt;-metaDescription:
resTitle := &lt;-pageTitle
fmt.Println(res)
fmt.Println(resTitle)
// Have been trying to output to CSV here but it&#39;s not working
// writer.Write([]string{url, resTitle, res})
// err := writer.WriteString(`res`)
// checkError(&quot;Cannot write to file&quot;, err)
case &lt;-time.After(time.Second * 2):
fmt.Println(&quot;timeout 2&quot;)
}
}(record[0])
fmt.Println()
lineCount++
}
}
func main() {
csvParsing()
//Code is to make sure there is a pause before program finishes so we can see output
var input string
fmt.Scanln(&amp;input)
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}

The data/sample.csv input file with URLs:

    http://jonathanmh.com
http://keshavmalani.com
http://google.com
http://bing.com
http://facebook.com

答案1

得分: 0

在你提供的代码中,你对以下代码进行了注释:

// Have been trying to output to CSV here but it&#39;s not working
err = writer.Write([]string{url, resTitle, res})
checkError(&quot;Cannot write to file&quot;, err)

这段代码是正确的,只是有一个问题。
在函数的前面,你有以下代码:

fileWrite, err := os.Create(&quot;data/result.csv&quot;)
checkError(&quot;Cannot create file&quot;, err)
defer fileWrite.Close()

这段代码会导致 fileWriter 在 csvParsing() 函数退出后关闭。因为你使用了 defer 关闭了 fileWriter,所以你无法在并发函数中向其写入数据。

解决方案:
你需要在并发函数中使用 defer fileWrite.Close() 或类似的方法,这样你就不会在写入数据之前关闭 fileWriter。

英文:

In the code you supplied, you had commented the following code:

// Have been trying to output to CSV here but it&#39;s not working
err = writer.Write([]string{url, resTitle, res})
checkError(&quot;Cannot write to file&quot;, err)

This code is correct, except you have one issue.
Earlier in the function, you have the following code:

fileWrite, err := os.Create(&quot;data/result.csv&quot;)
checkError(&quot;Cannot create file&quot;, err)
defer fileWrite.Close()

This code causes the fileWriter to close once your csvParsing() func exits.
Because you've closed fileWriter with the defer, you are unable to write to it in your concurrent function.

Solution:
You'll need to use defer fileWrite.Close() inside your concurrent func or something similar so you do not close the fileWriter before you have written to it.

huangapple
  • 本文由 发表于 2017年9月2日 14:42:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/46011228.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定