Android/Jsoup:如何解决编码问题

huangapple go评论70阅读模式
英文:

Android/ Jsoup: how to fix encoding issues

问题

我正在开发一个应用程序,用于在线获取立法文件,并自动解析和格式化以适应该应用程序。我正在使用的测试网站是:

http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

我想获取该网址的所有内容,解析(可能进行清理)并将其放入一个文件中。我正在使用 Jsoup,以下是我用于连接并将内容打印到文件的 Runnable:

class FetchHtmlRunnable implements Runnable {
    String url;

    FetchHtmlRunnable(String url) {
        this.url = url;
    }

    @Override
    public void run() {
        try {
            Document doc = Jsoup.parse(new URL(url), 10000);
            doc.charset(Charset.forName("windows-1252"));
            Charset charset = doc.charset();

            String htmlString = Jsoup.clean(doc.toString(), new Whitelist());

            Log.d(TAG, "run: HTMLSTRING: " + htmlString);

            String root = context.getFilesDir().toString();
            file = new File(root + File.separator + "law.txt");

            OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file, false), charset);
            out.write(htmlString);
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

然而,尽管 Chrome 告诉我该网站的编码是 windows-1252,但日志条目和文件不仅充满了替换字符(它丢失了所有带有变音符的字符,如 í 和 ã),而且还丢失了所有换行符:

Constituição                         Presidência da República     Casa Civil   Subchefia para Assuntos Jurídicos                CONSTITUIÇÃO DA REPÚBLICA FEDERATIVA DO BRASIL DE 1988              Vide Emenda Constitucional nº 91, de 2016    Vide Emenda Constitucional nº 106, de 2020  Vide Emenda Constitucional nº 107, de 2020     Emendas Constitucionais       Emendas Constitucionais de Revisão          Ato das Disposições Constitucionais Transitórias       Atos decorrentes do disposto no § 3º do art. 5º       ÍNDICE TEMÁTICO            Texto compilado        PREÂMBULO          Nós, representantes do povo brasileiro, reunidos em Assembleia Nacional Constituinte para instituir um Estado Democrático, destinado a assegurar o exercício dos direitos sociais e individuais, a liberdade, a segurança, o bem-estar, o desenvolvimento, a igualdade e a justiça como valores supremos de uma sociedade fraterna, pluralista e sem preconceitos, fundada na harmonia social e comprometida

也许在 web 开发方面更有经验的人可以告诉我这是否是网页本身的问题,以及我如何解决这个问题... 以及如何保留换行字符。

英文:

I'm developing an app to get legislation online and automatically parse and format it to fit the app. The test site i'm using is

http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

I want to grab all the contents of that URL, parse (maybe clean) them and put them in a file. I'm using Jsoup, this is the Runnable I use to connect and print the content to file:

class FetchHtmlRunnable implements Runnable {
        String url;

        FetchHtmlRunnable(String url) {
            this.url = url;
        }

        @Override
        public void run() {
            try {
                Document doc = Jsoup.parse(new URL(url), 10000);
                doc.charset(Charset.forName("windows-1252"));
                Charset charset = doc.charset();

                String htmlString = Jsoup.clean(doc.toString(), new Whitelist());

                Log.d(TAG, "run: HTMLSTRING: " + htmlString);

                String root = context.getFilesDir().toString();
                file = new File(root + File.separator + "law.txt");

                OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file, false), charset);
                out.write(htmlString);
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }

However, even though Chrome tells me the site's encoding is windows-1252, both the log entry and the file is not only filled with replacement characters (it loses all character with diacritics, such as í and ã), it also loses all new lines:

Constitui��o Presid�ncia da Rep�blica Casa Civil Subchefia para Assuntos Jur�dicos CONSTITUI��O DA REP�BLICA FEDERATIVA DO BRASIL DE 1988 Vide Emenda Constitucional n� 91, de 2016 Vide Emenda Constitucional n� 106, de 2020 Vide Emenda Constitucional n� 107, de 2020 Emendas Constitucionais Emendas Constitucionais de Revis�o Ato das Disposi��es Constitucionais Transit�rias Atos decorrentes do disposto no � 3� do art. 5� �NDICE TEM�TICO Texto compilado PRE�MBULO N�s, representantes do povo brasileiro, reunidos em Assembl�ia Nacional Constituinte para instituir um Estado Democr�tico, destinado a assegurar o exerc�cio dos direitos sociais e individuais, a liberdade, a seguran�a, o bem-estar, o desenvolvimento, a igualdade e a justi�a como valores supremos de uma sociedade fraterna, pluralista e sem preconceitos, fundada na harmonia social e comprometida

Maybe someone better at web dev can tell me if that's a problem with the webpage itslef and how I can work around that... And how I can keep the newline characters.

答案1

得分: 2

我将在接下来的一秒钟内,用葡萄牙语、西班牙语(和中文)写关于字符集的内容... 但首先,让我说一下,您正试图阅读的页面实际上是使用 "AJAX / JS" 加载页面内容的。我可以使用我在互联网上可用的自己的库下载 AJAX,但其他类似 SeleniumPuppeteerSplash 的工具也是必需的。在不提字符集的情况下,您是如何首先将您的“巴西宪法”内容下载到 HTML 中的?当我尝试使用纯HTML下载器(无脚本执行)时,我得到了一堆Java脚本,完全没有任何葡萄牙语,看起来与您在问题中发布的HTML完全不同。:)

如果您已经在下载HTML,并且只是在字符集方面遇到问题,请阅读下面的答案。如果您只能下载AJAX / JavaScript调用,而无法下载任何内容 - 我可以在另一个答案中解释如何用一两行代码执行JS / AJAX。(实际上,您发布的输出与我得到的不同)。


在99.9999%的情况下,如果它不是纯粹的 "ASCII"(因为它带有外语字符),那么使用 "UTF-8" 字符集几乎可以保证可读性。我翻译西班牙新闻文章和中文新闻文章 - UTF-8 对我总是奏效的。我曾经遇到一个需要使用 "iso8859-1" 编码的西班牙网站,但除了我找到它的《堂吉诃德》网站之外,UTF8都适用。

说实话,这从来都不是问题,因为在阅读网页(而不是编写网页)时,Java会自动将文本解析为UTF-8,无需任何配置。以下是我编写的库中的“打开连接”方法体:

HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
con.setRequestMethod                        ("GET");
if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
return new BufferedReader                   (new InputStreamReader(con.getInputStream()));

以下是我库中的“抓取内容”方法的方法体:

URL url = new URL("http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm");
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + "\n");
FileRW.writeFile(sb.toString(), "page.html");

坦白地说,我对Microsoft的字符集一无所知。我在UNIX中编写过代码,从未担心过任何字符集问题 - 除了确保在编写HTML(而不是阅读HTML)时,在我的页面中插入HTML <META CHARSET="utf-8"> 元素。

英文:

I will write the remainder of this answer about Character Sets in Portuguese, Spanish (And Chinese) in just a second... First, though, let me say that the page you are trying to read - actually loads the contents of the page using &quot;AJAX / JS&quot;. I can download AJAX using my own library available on the Internet, but other tools like Selenium, Puppeteer, or Splash would be necessary. Without mentioning character sets, how are you downloading the contents of your "Brazilian Constitution" to HTML in the first place? When I try a straight HTML downloader (no script execution), I get a pile of Java-Script without any Portuguese at all - and it looks nothing like the HTML posted in your question. Android/Jsoup:如何解决编码问题

If you are already downloading the HTML, and only have a problem with the character set, read the answer below. If you have been unable to download anything but the AJAX / JavaScript calls - I can post another answer that explains executing JS / AJAX in one or two lines in a different answer. (Essentially, what you posted isn't the same output that I'm getting).


In 99.9999% of the cases, if it is not straight up &quot;ASCII&quot; (because it has foreign language characters), then it is (almost) guaranteed to be readable using a version of &quot;UTF-8&quot; Character-Set. I translate Spanish News Articles and also Chinese News Articles - and UTF-8 always works for me. I had one Spanish Site that expected an encoding called &quot;iso8859-1&quot;, but other than the "Don Quijote de La Mancha" site where I found it - UTF8 works.

To tell you the truth, it is never an issue at all because when reading a web-page (as opposed to writing one), Java has automatically parsed the text as if it were UTF-8 without any configurations whatsoever. Here is the "Open Connection" Method Body from a library I have written:

HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
con.setRequestMethod                        (&quot;GET&quot;);
if (USE_USER_AGENT) con.setRequestProperty  (&quot;User-Agent&quot;, USER_AGENT);
return new BufferedReader                   (new InputStreamReader(con.getInputStream()));

Here is the method body of a "Scrape Contents" method from my library:

URL url = new URL(&quot;http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm&quot;);
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + &quot;\n&quot;);
FileRW.writeFile(sb.toString(), &quot;page.html&quot;);

I don't know the first thing about Microsoft Character Sets, to be fully honest with you. I have coded in UNIX, and I have never worried about any character sets - other than to make sure that when writing HTML (as opposed to Reading HTML), that the an HTML &lt;META CHARSET=&quot;utf-8&quot;&gt; element is inserted into my pages.

huangapple
  • 本文由 发表于 2020年9月23日 22:58:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/64030754.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定