当读取响应体时,如何将 Unicode 字符作为字符串获取(使用 Golang)

huangapple go评论79阅读模式
英文:

Get unicode characters as string when reading response body (Golang)

问题

我正在爬取一个用波兰语编写的网站,其中包含像ź和ę这样的字符。

当我尝试解析HTML时,无论是使用html包还是通过拆分响应正文的字符串,我都会得到如下输出:

���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��

我目前正在使用以下代码:

bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
  //处理错误
} 
bodyString := string(bodyBytes)

为了得到可读的字符串

如何以可读的格式获取文本?

英文:

I'm scraping a website that was written in Polish, meaning it contains characters such as ź and ę.

When I attempt to parse the html, either using the html package or even by splitting the string of the response body, I get output like this:

���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��

Im currently using

bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
  //handle
} 
bodyString := string(bodyBytes)

In order to get the string

How can I get the text in readable format?

答案1

得分: 2

更新:

由于响应的内容编码为gzip,下面的代码可用于将响应作为可打印的字符串获取:

gReader, err := gzip.NewReader(resp.Body)
if err != nil {
    return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
    return err
}
gReader.Close()
bodyStr := string(gBytes)
英文:

Update:

Since the content encoding of the response was gzip, the code below worked for getting the response as a printable string

gReader, err := gzip.NewReader(resp.Body)
if err != nil {
	return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
	return err
}
gReader.Close()
bodyStr := string(gBytes)

答案2

得分: 1

你正在使用的是哪个网站?当我在维基百科页面上进行测试时,我得到了正确的字符。

package main

import (
	"fmt"
	"io"
	"net/http"
)

func main() {
	resp, err := http.Get("https://en.wikipedia.org/wiki/Polish_alphabet")
	if err != nil {
		// 处理错误
	}
	defer resp.Body.Close()
	b, err := io.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}
	bodyStr := string(b)
	fmt.Println(bodyStr)
}
<td>Ą</td>
<td>Ć</td>
<td>Ę</td>
英文:

on wich website are you working ?
I'm getting correct characters when I'm testing on wikipedia page

package main

import (
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;net/http&quot;
)

func main() {
	resp, err := http.Get(&quot;https://en.wikipedia.org/wiki/Polish_alphabet&quot;)
	if err != nil {
		// handle error
	}
	defer resp.Body.Close()
	b, err := io.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}
	bodyStr := string(b)
	fmt.Println(bodyStr)
}

&lt;td&gt;Ą&lt;/td&gt;
&lt;td&gt;Ć&lt;/td&gt;
&lt;td&gt;Ę&lt;/td&gt;

huangapple
  • 本文由 发表于 2022年10月4日 19:28:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/73947208.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定