2017年2月8日 20:28:21go评论83阅读模式

英文:

Objects in go getting replaced

问题

我正在通过编写一个网络爬虫来学习Go语言。我试图从allpages.com获取所有的商业类别列表。

以下是我的整个程序。不幸的是，我无法确定问题所在，所以我将整个程序都粘贴了出来。

如果你运行这个程序，你会发现首先它正确地下载了第一个页面，并将所有提取到的类别添加到类别列表中。

然而，当它下载后续页面时，似乎搞乱了对父类别的引用。例如，它错误地计算了URL http://www.allpages.com/travel-tourism/political-ideological-organizations/，而实际上political-ideological-organizations/并不是travel-tourism/的子类别。通过查看日志，我发现它覆盖了parent对象中的数据。错误在有更多工作线程时更加明显。

在我开始将数据通过引用传递给goroutine之前，这个问题的情况要好一些，但本质上问题是一样的。

我有几个问题：

在不必查看日志行的情况下，我该如何调试这个问题？
出了什么问题/为什么它不工作，如何修复？

package main

import (
	"fmt"
	"github.com/PuerkitoBio/goquery"
	"log"
	"strconv"
	"strings"
	"regexp"
)

const domain = "http://www.allpages.com/"
const categoryPage = "category.html"

type Category struct {
	url     string
	level   uint
	name    string
	entries int
	parent  *Category
}

type DownloadResult struct {
	doc      *goquery.Document
	category *Category
}

const WORKERS = 2
const SEPARATOR = "§§§"

func main() {

	allCategories := make([]Category, 0)

	downloadChannel := make(chan *Category)
	resultsChannel := make(chan *DownloadResult, 100)

	for w := 1; w <= WORKERS; w++ {
		go worker(downloadChannel, resultsChannel)
	}

	numRequests := 1
	downloadChannel <- &Category{domain + categoryPage, 0, "root", 0, nil}

	for result := range resultsChannel {
		var extractor func(doc *goquery.Document) []string

		if result.category.level == 0 {
			extractor = topLevelExtractor
		} else if result.category.level == 1 {
			extractor = secondLevelExtractor
		} else {
			extractor = thirdLevelExtractor
		}

		categories := extractCategories(result.doc, result.category, extractor)
		allCategories = append(allCategories, *categories...)

		//fmt.Printf("Appending categories: %v", *categories)

		fmt.Printf("total categories = %d, total requests = %d\n", len(allCategories), numRequests)

		for _, category := range *categories {
			numRequests += 1
			downloadChannel <- &category
		}

		// close the channels when there are no more jobs
		if len(allCategories) > numRequests {
			close(downloadChannel)
			close(resultsChannel)
		}
	}

	fmt.Println("Done")
}

func worker(downloadChannel <-chan *Category, results chan<- *DownloadResult) {
	for target := range downloadChannel {
		fmt.Printf("Downloading %v (addr %p) ...", target, &target)

		doc, err := goquery.NewDocument(target.url)
		if err != nil {
			log.Fatal(err)
			panic(err)
		}

		fmt.Print("done \n")

		results <- &DownloadResult{doc, target}
	}
}

func extractCategories(doc *goquery.Document, parent *Category, extractor func(doc *goquery.Document) []string) *[]Category {

	numberRegex, _ := regexp.Compile("[0-9,]+")

	log.Printf("Extracting subcategories for page %s\n", parent)

	subCategories := extractor(doc)

	categories := make([]Category, 0)

	for _, subCategory := range subCategories {
		log.Printf("Got subcategory=%s from parent=%s", subCategory, parent)
		extracted := strings.Split(subCategory, SEPARATOR)

		numberWithComma := numberRegex.FindString(extracted[2])
		number := strings.Replace(numberWithComma, ",", "", -1)

		numRecords, err := strconv.Atoi(number)
		if err != nil {
			log.Fatal(err)
			panic(err)
		}

		var category Category

		level := parent.level + 1

		if parent.level == 0 {
			category = Category{domain + extracted[1], level, extracted[0], numRecords, parent}
		} else {
			log.Printf("category URL=%s, parent=%s, parent=%v", extracted[1], parent.url, parent)
			category = Category{parent.url + extracted[1], level, extracted[0], numRecords, parent}
		}

		log.Printf("Appending category=%v (pointer=%p)", category, &category)

		categories = append(categories, category)
	}

	return &categories
}

func topLevelExtractor(doc *goquery.Document) []string {
	return doc.Find(".cat-listings-td .c-1s-2m-1-td1").Map(func(i int, s *goquery.Selection) string {
		title := s.Find("a").Text()
		url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
			v, _ := a.Attr("href")
			return v
		})
		records := s.Clone().Children().Remove().End().Text()

		//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)

		res := []string{title, url[0], records}
		return strings.Join(res, SEPARATOR)
	})
}

func secondLevelExtractor(doc *goquery.Document) []string {
	return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
		title := s.Find("a").Text()
		url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
			v, _ := a.Attr("href")
			return v
		})
		records := s.Clone().Children().Remove().End().Text()

		//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)

		res := []string{title, url[0], records}
		return strings.Join(res, SEPARATOR)
	})
}

func thirdLevelExtractor(doc *goquery.Document) []string {
	return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
		title := s.Find("a").Text()
		url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
			v, _ := a.Attr("href")
			return v
		})
		records := s.Clone().Children().Remove().End().Text()

		//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)

		res := []string{title, url[0], records}
		return strings.Join(res, SEPARATOR)
	})
}

更新
问题已解决-请参阅下面的评论。

英文:

I'm learning go by writing a web spider. I'm trying to get a list of all the business categories from allpages.com.

Below is my entire program. Unfortunately I can't isolate the issue so I've pasted it all.

If you run this program, you'll see that first of all it correctly downloads the first page, and adds all the extracted categories to the list of categories.

However, when it then downloads subsequent pages, it seems to mess up the reference to the parent category. E.g. it incorrectly calculates the URL http://www.allpages.com/travel-tourism/political-ideological-organizations/, when in fact political-ideological-organizations/ is not a subcategory of travel-tourism/. Digging through the logs it seems to overwrite the data in the parent object. The error is more pronounced the more workers there are.

This was working a bit better before I started passing data by reference to the goroutine, but I had essentially the same issue.

I've got several questions:

How can I debug this without resorting to picking through log lines?

What's wrong/why isn't it working and how can it be fixed?

 package main
import (
&quot;fmt&quot;
&quot;github.com/PuerkitoBio/goquery&quot;
&quot;log&quot;
&quot;strconv&quot;
&quot;strings&quot;
&quot;regexp&quot;
)
const domain = &quot;http://www.allpages.com/&quot;
const categoryPage = &quot;category.html&quot;
type Category struct {
url string
level uint
name string
entries int
parent *Category
}
type DownloadResult struct {
doc *goquery.Document
category *Category
}
const WORKERS = 2
const SEPARATOR = &quot;&#167;&#167;&#167;&quot;
func main() {
allCategories := make([]Category, 0)
downloadChannel := make(chan *Category)
resultsChannel := make(chan *DownloadResult, 100)
for w := 1; w &lt;= WORKERS; w++ {
go worker(downloadChannel, resultsChannel)
}
numRequests := 1
downloadChannel &lt;- &amp;Category{ domain + categoryPage, 0, &quot;root&quot;, 0, nil }
for result := range resultsChannel {
var extractor func(doc *goquery.Document) []string
if result.category.level == 0 {
extractor = topLevelExtractor
} else if result.category.level == 1 {
extractor = secondLevelExtractor
} else {
extractor = thirdLevelExtractor
}
categories := extractCategories(result.doc, result.category, extractor)
allCategories = append(allCategories, *categories...)
//fmt.Printf(&quot;Appending categories: %v&quot;, *categories)
fmt.Printf(&quot;total categories = %d, total requests = %d\n&quot;, len(allCategories), numRequests)
for _, category := range *categories {
numRequests += 1
downloadChannel &lt;- &amp;category
}
// close the channels when there are no more jobs
if len(allCategories) &gt; numRequests {
close(downloadChannel)
close(resultsChannel)
}
}
fmt.Println(&quot;Done&quot;)
}
func worker(downloadChannel &lt;-chan *Category, results chan&lt;- *DownloadResult) {
for target := range downloadChannel {
fmt.Printf(&quot;Downloading %v (addr %p) ...&quot;, target, &amp;target)
doc, err := goquery.NewDocument(target.url)
if err != nil {
log.Fatal(err)
panic(err)
}
fmt.Print(&quot;done \n&quot;)
results &lt;- &amp;DownloadResult{doc, target}
}
}
func extractCategories(doc *goquery.Document, parent *Category, extractor func(doc *goquery.Document) []string) *[]Category {
numberRegex, _ := regexp.Compile(&quot;[0-9,]+&quot;)
log.Printf(&quot;Extracting subcategories for page %s\n&quot;, parent)
subCategories := extractor(doc)
categories := make([]Category, 0)
for _, subCategory := range subCategories {
log.Printf(&quot;Got subcategory=%s from parent=%s&quot;, subCategory, parent)
extracted := strings.Split(subCategory, SEPARATOR)
numberWithComma := numberRegex.FindString(extracted[2])
number := strings.Replace(numberWithComma, &quot;,&quot;, &quot;&quot;, -1)
numRecords, err := strconv.Atoi(number)
if err != nil {
log.Fatal(err)
panic(err)
}
var category Category
level := parent.level + 1
if parent.level == 0 {
category = Category{ domain + extracted[1], level, extracted[0], numRecords, parent }
} else {
log.Printf(&quot;category URL=%s, parent=%s, parent=%v&quot;, extracted[1], parent.url, parent)
category = Category{ parent.url + extracted[1], level, extracted[0], numRecords, parent }
}
log.Printf(&quot;Appending category=%v (pointer=%p)&quot;, category, &amp;category)
categories = append(categories, category)
}
return &amp;categories
}
func topLevelExtractor(doc *goquery.Document) []string {
return doc.Find(&quot;.cat-listings-td .c-1s-2m-1-td1&quot;).Map(func(i int, s *goquery.Selection) string {
title := s.Find(&quot;a&quot;).Text()
url := s.Find(&quot;a&quot;).Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr(&quot;href&quot;)
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf(&quot;Item %d: %s, %s - %s\n&quot;, i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}
func secondLevelExtractor(doc *goquery.Document) []string {
return doc.Find(&quot;.c-2m-3c-1-table .c-2m-3c-1-td1&quot;).Map(func(i int, s *goquery.Selection) string {
title := s.Find(&quot;a&quot;).Text()
url := s.Find(&quot;a&quot;).Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr(&quot;href&quot;)
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf(&quot;Item %d: %s, %s - %s\n&quot;, i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}
func thirdLevelExtractor(doc *goquery.Document) []string {
return doc.Find(&quot;.c-2m-3c-1-table .c-2m-3c-1-td1&quot;).Map(func(i int, s *goquery.Selection) string {
title := s.Find(&quot;a&quot;).Text()
url := s.Find(&quot;a&quot;).Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr(&quot;href&quot;)
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf(&quot;Item %d: %s, %s - %s\n&quot;, i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}

Update
Fixed - see comment below.

答案1

得分: 0

循环遍历：

for _, category := range *categories {
numRequests += 1
downloadChannel <- &category
}

意味着我将对临时变量 category 的引用发送到通道中，而不是该值的实际内存地址。

我通过使用不同的循环来修复了这个问题：

for i := 0; i < len(*categories); i++ {
fmt.Printf("Queuing category: %v (%p)", categoriesValues[i], &categoriesValues[i])
downloadChannel <- &categoriesValues[i]
}

英文:

Looping over:

            for _, category := range *categories {
numRequests += 1
downloadChannel &lt;- &amp;category
}

meant I was sending a reference to the temporary variable category to the channel, instead of the actual memory address of that value.

I've fixed this by using a different loop:

	for i := 0; i &lt; len(*categories); i++ {
fmt.Printf(&quot;Queuing category: %v (%p)&quot;, categoriesValues[i], categoriesValues[i])
downloadChannel &lt;- &amp;categoriesValues[i]
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go中的对象正在被替换。

问题

答案1

Golang断点在IntelliJ IDEA 2016.1.1中无法工作。

Os/exec elegant, loop compatible stdin and stdout input/output

Proper way to decode asn1.RawValue in Golang

检查 JSON 数组的长度而不进行解组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论