How can I scrape values from embedded Javascript in HTML?

huangapple go评论78阅读模式
英文:

How can I scrape values from embedded Javascript in HTML?

问题

我需要从网页中嵌入的Javascript中解析一些值。
我尝试使用类似这样的方法对HTML进行标记化,但它没有对Javascript部分进行标记化。

func CheckSitegroup(httpBody io.Reader) []string {
	sitegroups := make([]string, 0)
	page := html.NewTokenizer(httpBody)
	for {
		tokenType := page.Next()
		fmt.Println("TokenType:", tokenType)
		// 检查HTML文件是否已结束
		if tokenType == html.ErrorToken {
			return sitegroups
		}
		token := page.Token()
		fmt.Println("Token:", token)
		if tokenType == html.StartTagToken && token.DataAtom.String() == "script" {
			for _, attr := range token.Attr {
				fmt.Println("ATTR.KEY:", attr.Key)
				sitegroups = append(sitegroups, attr.Val)
			}
		}
	}
}

HTML-body中的脚本如下,我需要获取活动编号(如果没有编号或者没有test.campaign =,则为nil / "" - 对于sitegroup也是如此)。
有没有一种简单的方法来获取这些信息?我考虑过正则表达式,但也许还有其他方法?我从未使用过正则表达式。

<script type="text/javascript">
	var test = {};
	test.campaign = "8d26113ba";
	test.isTest = "false";
	test.sitegroup = "Homepage";
</script>
英文:

I need to parse some values out of embedded Javascript in a webpage.
I tried to tokenize the HTML with something like this but it doesn't tokenize the Javascript part.

func CheckSitegroup(httpBody io.Reader) []string {
	sitegroups := make([]string, 0)
	page := html.NewTokenizer(httpBody)
	for {
		tokenType := page.Next()
		fmt.Println(&quot;TokenType:&quot;, tokenType)
		// check if HTML file has ended
		if tokenType == html.ErrorToken {
			return sitegroups
		}
		token := page.Token()
		fmt.Println(&quot;Token:&quot;, token)
		if tokenType == html.StartTagToken &amp;&amp; token.DataAtom.String() == &quot;script&quot; {
			for _, attr := range token.Attr {
				fmt.Println(&quot;ATTR.KEY:&quot;, attr.Key)
				sitegroups = append(sitegroups, attr.Val)
			}
		}
	}
}

The Script in the HTML-body looks like this and I need the campaign number (nil / "" if there is no number or if there is no test.campaign = at all - same goes for the sitegroup).
Is there an easy way to get the information? I thought about regular expressions but maybe there is something else? Never worked with regex.

&lt;script type=&quot;text/javascript&quot; &gt;
	var test = {};
	test.campaign = &quot;8d26113ba&quot;;
	test.isTest = &quot;false&quot;;
	test.sitegroup = &quot;Homepage&quot;;
&lt;/script&gt;

答案1

得分: 2

首先,你需要安全地获取JS代码。最简单的方法是使用goquery库:https://github.com/PuerkitoBio/goquery

之后,你需要安全地获取变量。根据情况的复杂程度,你可以解析真实的JS抽象语法树,并使用GO中出色的JS解释器查找正确的变量,例如:http://godoc.org/github.com/robertkrimen/otto/parser

或者,如你在上述情况中提到的,使用正则表达式也是非常简单的。有一个非常好的关于在Go中使用正则表达式的教程:https://github.com/StefanSchroeder/Golang-Regex-Tutorial

英文:

first you need to get the JS code safely. The easiest way would be with the goquery lib: https://github.com/PuerkitoBio/goquery

after that you need to get the variables safely. Depending on how complicated it gets you could either parse the real JS Abstract syntax tree and look for the right variables for example with the excellent JS interpreter in GO: http://godoc.org/github.com/robertkrimen/otto/parser

or as you mentioned in the case mentioned above regex would be really easy. There is a really nice tutorial on regexes in go : https://github.com/StefanSchroeder/Golang-Regex-Tutorial

答案2

得分: 0

Go标准库中的strings包提供了许多有用的函数,您可以使用这些函数来解析JavaScript代码以获取所需的广告活动编号。

以下代码可以从您问题中提供的js代码中获取广告活动编号(在Go Playground上运行代码):

package main

import (
	"bufio"
	"fmt"
	"os"
	"strings"
)

const js = `                                                                    
<script type="text/javascript">                                                
    var test = {};                                                              
    test.campaign = "8d26113ba";                                                 
    test.isTest = "false";                                                      
    test.sitegroup = "Homepage";                                                 
</script>                                                                       
`

func StringToLines(s string) []string {
	var lines []string

	scanner := bufio.NewScanner(strings.NewReader(s))
	for scanner.Scan() {
		lines = append(lines, scanner.Text())
	}

	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, "reading standard input:", err)
	}

	return lines
}

func getCampaignNumber(line string) string {
	tmp := strings.Split(line, "=")[1]
	tmp = strings.TrimSpace(tmp)
	tmp = tmp[1 : len(tmp)-2]
	return tmp
}

func main() {
	lines := StringToLines(js)
	for _, line := range lines {
		if strings.Contains(line, "campaign") {
			result := getCampaignNumber(line)
			println(result)
		}
	}
}
英文:

The Go standard strings library comes with a lot of useful functions which you can use to parse the JavaScript code to get campaign number you need.

The following code can get the campaign number from the js code provided in your question (Run code on Go Playground):

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;strings&quot;
)

const js = `                                                                    
&lt;script type=&quot;text/javascript&quot; &gt;                                                
    var test = {};                                                              
    test.campaign = &quot;8d26113ba&quot;;                                                
    test.isTest = &quot;false&quot;;                                                      
    test.sitegroup = &quot;Homepage&quot;;                                                
&lt;/script&gt;                                                                       
`

func StringToLines(s string) []string {
	var lines []string

	scanner := bufio.NewScanner(strings.NewReader(s))
	for scanner.Scan() {
		lines = append(lines, scanner.Text())
	}

	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, &quot;reading standard input:&quot;, err)
	}

	return lines
}

func getCampaignNumber(line string) string {
	tmp := strings.Split(line, &quot;=&quot;)[1]
	tmp = strings.TrimSpace(tmp)
	tmp = tmp[1 : len(tmp)-2]
	return tmp
}

func main() {
	lines := StringToLines(js)
	for _, line := range lines {
		if strings.Contains(line, &quot;campaign&quot;) {
			result := getCampaignNumber(line)
			println(result)
		}
	}
}

huangapple
  • 本文由 发表于 2015年8月7日 04:01:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/31864758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定