Golang正则表达式替换域名为代理URL

huangapple go评论101阅读模式
英文:

Golang Regex to replace domain to proxy URL

问题

我想将网页中的所有链接替换为反向代理域名。

规则如下:

https://test.com/xxx --> https_test_com.proxy.com/xxx
http://sub.test.com/xxx --> http_sub_test_com.proxy.com/xxx

如何在golang中使用正则表达式实现?

响应体的类型是[]byte,其字符编码为UTF-8。
我已经尝试了以下方法,但它无法将原始域名中的所有替换为下划线。子域名的长度是可变的,这意味着的数量可能会变化。

respBytes := []byte(`_.Xc=function(a){var b=window.google&&window.google.logUrl?"":"https://www.google.com";b+="/gen_204?";b+=a.j(2040-b.length);
		<cite class="iUh30 Zu0yb tjvcx">https://cloud.google.com</cite></div><div class="eFM0qc"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:80SWJ_cSDhwJ:https://cloud.google.com/+&cd=1&hl=en&ct=clnk&gl=au" ping="/url?sa=t&source=web&rct=j&url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:80SWJ_cSDhwJ:https://cloud.google.com/%2B%26cd%3D1%26hl%3Den%26ct%3Dclnk%26gl%3Dau&ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQIDAAegQIBRAG"><span>Cached</span></a></li><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="/search?q=related:https://cloud.google.com/+google+cloud&sa=X&ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQHzAAegQIBRAH">
		`)
proxyURI := "proxy.com"
var re = regexp.MustCompile(`(http
展开收缩
*):\/\/([a-zA-Z0-9_\-.:]*)`
)
content := re.ReplaceAll(respBytes, []byte("${1}_${2}."+proxyURI))
原始链接 替换结果 期望结果
https://www.google.com https_www.google.com.test.com https_www_google_com.test.com
https://cloud.google.com https_cloud.google.com.test.com https_cloud_google_com.test.com
https://https://webcache.googleusercontent.com https_cloud.google.com.test.com https_webcache_googleusercontent_com.test.com
英文:

I want to replace all links of a webpage to a reverse proxy domain.

The rules are

https://test.com/xxx --> https_test_com.proxy.com/xxx
http://sub.test.com/xxx --> http_sub_test_com.proxy.com/xxx

How to achieve it by regex in golang?

The type of response body is []byte, and character encoding of it is UTF-8.
I have tried in this way. But it cannot replace all the dot to underscore in the origin domain. The length of subdomain is variable, that means the number of dot can vary

respBytes := []byte(`_.Xc=function(a){var b=window.google&&window.google.logUrl?"":"https://www.google.com";b+="/gen_204?";b+=a.j(2040-b.length);
		<cite class="iUh30 Zu0yb tjvcx">https://cloud.google.com</cite></div><div class="eFM0qc"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:80SWJ_cSDhwJ:https://cloud.google.com/+&cd=1&hl=en&ct=clnk&gl=au" ping="/url?sa=t&source=web&rct=j&url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:80SWJ_cSDhwJ:https://cloud.google.com/%2B%26cd%3D1%26hl%3Den%26ct%3Dclnk%26gl%3Dau&ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQIDAAegQIBRAG"><span>Cached</span></a></li><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="/search?q=related:https://cloud.google.com/+google+cloud&sa=X&ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQHzAAegQIBRAH">
		`)
proxyURI := "proxy.com"
var re = regexp.MustCompile(`(http
展开收缩
*):\/\/([a-zA-Z0-9_\-.:]*)`) content := re.ReplaceAll(respBytes, []byte("_."+proxyURI))
origin result expect
https://www.google.com https_www.google.com.test.com https_www_google_com.test.com
https://cloud.google.com https_cloud.google.com.test.com https_cloud_google_com.test.com
https://https://webcache.googleusercontent.com https_cloud.google.com.test.com https_webcache_googleusercontent_com.test.com

答案1

得分: 0

以下是翻译好的内容:

这是如何实现的:

func replaceAndPrint() {
	src := `
<a href="https://test.com/xxx">link 1</a>
<a href="https://test.com/yyy">link 2</a>
`
	r := regexp.MustCompile(`\"https://(test\.com.*)\"`)
	result := r.ReplaceAllString(src, `http://sub.$1`)
	fmt.Println(result)
}

输出:

<a href=http://sub.test.com/xxx>link 1</a>
<a href=http://sub.test.com/yyy>link 2</a>

解释:
regexp.MustCompile的参数定义了一个捕获组(在一对括号内)。该捕获组的值在调用r.ReplaceAllString时通过$1引用。

更新:

抱歉,之前理解错了示例。

这是更新后的版本:

func replaceAndPrint2() {
	src := `
<a href="http://test.com/xxx">link 1</a>
<a href="https://sub1.sub2.test.com/yyy">link 2</a>
`
	r := regexp.MustCompile(`(\.|://)([^./]*)`)
	replacer := strings.NewReplacer(`://`, `_`, `.`, `_`)
	res := r.ReplaceAllStringFunc(src, func(g string) string {
		if g == `.com` {
			return replacer.Replace(g) + `.proxy.com`
		}
		return replacer.Replace(g)
	})
	fmt.Println(res)
}

输出:

<a href="http_test_com.proxy.com/xxx">link 1</a>
<a href="https_sub1_sub2_test_com.proxy.com/yyy">link 2</a>
英文:

Here's how you can do this:

func replaceAndPrint() {
	src := `
<a href="https://test.com/xxx">link 1</a>
<a href="https://test.com/yyy">link 2</a>
`
	r := regexp.MustCompile("\"https://(test\\.com.*)\"")
	result := r.ReplaceAllString(src, "http://sub.$1")
	fmt.Println(result)
}

Output:

<a href=http://sub.test.com/xxx>link 1</a>
<a href=http://sub.test.com/yyy>link 2</a>

Explanation:
regexp.MustCompile's argument defines a capturing group (inside a pair of parentheses). The value of that capturing group is referenced by $1 in the call to r.ReplaceAllString.

UPDATE:

Sorry, misread the example.

Here's an updated version:

func replaceAndPrint2() {
	src := `
<a href="http://test.com/xxx">link 1</a>
<a href="https://sub1.sub2.test.com/yyy">link 2</a>
`
	r := regexp.MustCompile("(\\.|://)([^./]*)")
	replacer := strings.NewReplacer("://", "_", ".", "_")
	res := r.ReplaceAllStringFunc(src, func(g string) string {
		if g == ".com" {
			return replacer.Replace(g) + ".proxy.com"
		}
		return replacer.Replace(g)
	})
	fmt.Println(res)
}

Output:

<a href="http_test_com.proxy.com/xxx">link 1</a>
<a href="https_sub1_sub2_test_com.proxy.com/yyy">link 2</a>

huangapple
  • 本文由 发表于 2021年7月25日 10:33:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/68515110.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定