Golang表格网页抓取

huangapple go评论84阅读模式
英文:

Golang table webscraping

问题

我有以下代码来从HTML表格中抓取特定单元格的值。你可以访问https://www.haremaltin.com/altin-fiyatlari网站,并在检查模式下搜索"satis__ATA_ESKI"以查看该值。我是Go语言的初学者,尽力了,但不幸的是我无法获取到那个值。有人可以帮助我吗?顺便说一下,添加time.sleep以等待页面加载。如果返回"-",那是因为页面尚未加载完成。

package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	url := "https://www.haremaltin.com/altin-fiyatlari"

	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()
	if resp.StatusCode != 200 {
		log.Fatalf("failed to fetch data: %d %s", resp.StatusCode, resp.Status)
	}

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		log.Fatal(err)
	}

	doc.Find("tr__ATA_ESKI tr").Each(func(j int, tr *goquery.Selection) {
		data := []string{}
		tr.Find("td").Each(func(ix int, td *goquery.Selection) {
			e := td.Text()
			data = append(data, e)
			fmt.Println(data)
		})
	})
}

解决方案:

你可以在下面看到答案,如果你想的话,可以查看Golang表格网页抓取,了解为什么使用这种解决方案。

顺便说一下,我们可以使用迭代来从映射中获取特定的值。我也有一个这样的代码。但如果你有更简单的方法,请留言告诉我。

for _, v := range data { // 我们需要映射的值部分
    m, ok := v.(map[string]interface{}) // 我们需要将映射转换为接口以进行迭代
    if !ok {
        fmt.Printf("Error %T", v)
    }
    for k, l := range m {
        if k == "ATA_ESKI" { // 我们想要的值在这个映射中
            a, ok := l.(map[string]interface{}) // 再次进行接口转换
            if !ok {
                fmt.Printf("Error %T", v)
            }
            for b, c := range a {
                if b == "satis" { // 我们想要的值
                    fmt.Println("价格是", c)
                }
            }
        }
    }
}

完整的迭代解决方案如下:

package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"strings"
)

func main() {
	fetchData()
}

func fetchData() (map[string]interface{}, error) {
	body := strings.NewReader("dil_kodu=tr")
	req, err := http.NewRequest("POST", "https://www.haremaltin.com/dashboard/ajax/doviz", body)
	if err != nil {
		// 处理错误
		return nil, err
	}
	req.Header.Set("X-Requested-With", "XMLHttpRequest")

	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		// 处理错误
		return nil, err
	}
	defer resp.Body.Close()
	jsonData, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		panic(err)
		return nil, err
	}

	var data map[string]interface{}
	err = json.Unmarshal(jsonData, &data)
	if err != nil {
		return nil, err
	}

	for _, v := range data {
		m, ok := v.(map[string]interface{})
		if !ok {
			fmt.Printf("Error %T", v)
		}
		for k, l := range m {
			if k == "ATA_ESKI" {
				a, ok := l.(map[string]interface{})
				if !ok {
					fmt.Printf("Error %T", v)
				}
				for b, c := range a {
					if b == "satis" {
						fmt.Println("价格", c)
					}
				}
			}
		}
	}

	return data, nil
}
英文:

I have a code as below to scrape the specific cell value from html table. You can go to https://www.haremaltin.com/altin-fiyatlari website and search "satis__ATA_ESKI" on inspect mode to see that value. I am beginner on golang and did my best but unfortunately I couldn't get that value. Is there anybody to help me? Btw they don't have a community api. And one more thing, add time.sleep to wait for page to be loaded. If it returns "-" it is because page wasn't be loaded yet

package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
url := "https://www.haremaltin.com/altin-fiyatlari"
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("failed to fetch data: %d %s", resp.StatusCode, resp.Status)
}
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
doc.Find("tr__ATA_ESKI tr").Each(func(j int, tr *goquery.Selection) {
data := []string{}
tr.Find("td").Each(func(ix int, td *goquery.Selection) {
e := td.Text()
data = append(data, e)
fmt.Println(data)
})
})
}

SOLUTION:

You can see the answer below and if you want you can check Golang表格网页抓取 to see why this kind of solution is used

Btw we can use iteration to fetch the specific value from map. I have a code for this too. But if you have any easier method just comment please

for _, v := range data { // we need value part of the map
m, ok := v.(map[string]interface{}) // we need the convert the map 
// into interface for iteration
if !ok {
fmt.Printf("Error %T", v)
}
for k, l := range m {
if k == "ATA_ESKI"{ // the value we want is inside of this map
a, ok := l.(map[string]interface{}) // interface convert again
if !ok {
fmt.Printf("Error %T", v)
}
for b,c := range a{
if b == "satis"{ // the value we want
fmt.Println("Price is", c)
}
}
}
}
}

Full solution with iteration below:

package main
import (
"encoding/json"
"fmt"
"io/ioutil"
"net/http"
"strings"
)
func main() {
fecthData()
}
func fecthData() (map[string]interface{}, error) {
body := strings.NewReader("dil_kodu=tr")
req, err := http.NewRequest("POST", 
"https://www.haremaltin.com/dashboard/ajax/doviz", body)
if err != nil {
// handle err
return nil, err
}
req.Header.Set("X-Requested-With", "XMLHttpRequest")
resp, err := http.DefaultClient.Do(req)
if err != nil {
// handle err
return nil, err
}
defer resp.Body.Close()
jsonData, err := ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
return nil, err
}
var data map[string]interface{}
err = json.Unmarshal(jsonData, &data)
if err != nil {
return nil, err
}
for _, v := range data {
m, ok := v.(map[string]interface{})
if !ok {
fmt.Printf("Error %T", v)
}
for k, l := range m {
if k == "ATA_ESKI" {
a, ok := l.(map[string]interface{})
if !ok {
fmt.Printf("Error %T", v)
}
for b, c := range a {
if b == "satis" {
fmt.Println("Price", c)
}
}
}
}
}
return data, nil
}

答案1

得分: 3

你可以通过HTTP POST请求获取数据。不要忘记在请求中添加X-Requested-With头。

func fetchData() (map[string]interface{}, error) {
    body := strings.NewReader("dil_kodu=tr")
    req, err := http.NewRequest("POST", "https://www.haremaltin.com/dashboard/ajax/doviz", body)
    if err != nil {
        // 处理错误
        return nil, err
    }
    req.Header.Set("X-Requested-With", "XMLHttpRequest")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        // 处理错误
        return nil, err
    }
    defer resp.Body.Close()
    jsonData, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        panic(err)
        return nil, err
    }
    var data map[string]interface{}
    err = json.Unmarshal(jsonData, &data)
    if err != nil {
        return nil, err
    }
    return data, nil
}
英文:

You can fetch via http Post request. Do not forget to add X-Requested-With header to request.

func fecthData() (map[string]interface{}, error) {
body := strings.NewReader("dil_kodu=tr")
req, err := http.NewRequest("POST", "https://www.haremaltin.com/dashboard/ajax/doviz", body)
if err != nil {
// handle err
return nil, err
}
req.Header.Set("X-Requested-With", "XMLHttpRequest")
resp, err := http.DefaultClient.Do(req)
if err != nil {
// handle err
return nil, err
}
defer resp.Body.Close()
jsonData, err := ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
return nil, err
}
var data map[string]interface{}
err = json.Unmarshal(jsonData, &data)
if err != nil {
return nil, err
}
return data, nil
}

答案2

得分: 1

由于该表格由JavaScript驱动,我建议您采用不同的方法。原因如下。

您实际上要抓取的是这个网页:

curl https://www.haremaltin.com/altin-fiyatlari > out.html

您可以在终端中运行此curl命令,得到与Go的REST请求完全相同的响应("完全相同"是一个强烈的说法,大多数情况下,肯定是这样的)。

正如您所看到的,您创建的out.html文件中没有任何值,这就是为什么您的Go脚本没有返回任何值的原因。

您需要运行JavaScript来填充页面,然后才能进行抓取。

我在几个项目中使用了https://github.com/chromedp/chromedp,并取得了很大的成功。使用这个工具,您的工作流程将如下所示:

  1. 打开无头浏览器
  2. 转到URL
  3. 转储页面的HTML
  4. 使用goquery解析
  5. 打印您的响应
英文:

Since the table is powered by javascript, i would suggest you use a different approach. Here's why.

What you're really scraping is

curl https://www.haremaltin.com/altin-fiyatlari > out.html

this web page. You can run this curl in a terminal and get the exact same reply as go's rest request ( exact is a strong word, most of the time, for sure this case )

As you can see no values are present in that out.html file you created, thats why your go script isn't returning any values.

You need to have javascript running to populate the page, so you can then scrape it.

I've used this https://github.com/chromedp/chromedp in a couple projects with great success. By using this tool your workflow will look something like..

  1. open headless browser
  2. go to url
  3. dump pages html
  4. parse with goquery
  5. print your response

huangapple
  • 本文由 发表于 2022年6月17日 01:21:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/72649649.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定