英文:
Get unicode characters as string when reading response body (Golang)
问题
我正在爬取一个用波兰语编写的网站,其中包含像ź和ę这样的字符。
当我尝试解析HTML时,无论是使用html包还是通过拆分响应正文的字符串,我都会得到如下输出:
���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��
我目前正在使用以下代码:
bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
//处理错误
}
bodyString := string(bodyBytes)
为了得到可读的字符串
如何以可读的格式获取文本?
英文:
I'm scraping a website that was written in Polish, meaning it contains characters such as ź and ę.
When I attempt to parse the html, either using the html package or even by splitting the string of the response body, I get output like this:
���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��
Im currently using
bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
//handle
}
bodyString := string(bodyBytes)
In order to get the string
How can I get the text in readable format?
答案1
得分: 2
更新:
由于响应的内容编码为gzip,下面的代码可用于将响应作为可打印的字符串获取:
gReader, err := gzip.NewReader(resp.Body)
if err != nil {
return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
return err
}
gReader.Close()
bodyStr := string(gBytes)
英文:
Update:
Since the content encoding of the response was gzip, the code below worked for getting the response as a printable string
gReader, err := gzip.NewReader(resp.Body)
if err != nil {
return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
return err
}
gReader.Close()
bodyStr := string(gBytes)
答案2
得分: 1
你正在使用的是哪个网站?当我在维基百科页面上进行测试时,我得到了正确的字符。
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
resp, err := http.Get("https://en.wikipedia.org/wiki/Polish_alphabet")
if err != nil {
// 处理错误
}
defer resp.Body.Close()
b, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
bodyStr := string(b)
fmt.Println(bodyStr)
}
<td>Ą</td>
<td>Ć</td>
<td>Ę</td>
英文:
on wich website are you working ?
I'm getting correct characters when I'm testing on wikipedia page
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
resp, err := http.Get("https://en.wikipedia.org/wiki/Polish_alphabet")
if err != nil {
// handle error
}
defer resp.Body.Close()
b, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
bodyStr := string(b)
fmt.Println(bodyStr)
}
<td>Ą</td>
<td>Ć</td>
<td>Ę</td>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论