英文:
Retaining special character while reading from html java?
问题
我正在尝试读取包含德语字符(如 ä ö ü ß €)的 HTML 源文件。
使用 JSOUP 进行读取:
citAttr.nextElementSibling().text()
将字符串进行编码:
unicodeEscaper.translate(citAttr.nextElementSibling().text())
org.apache.commons.lang3.text.translate.UnicodeEscaper
问题是,读取后,字符变成了 �。
然而,使用编码类型为 UTF-8 的 CSV 进行读取,并使用上述 unicodeEscaper 进行保存和检索,字符工作正常。
unicodeEscaper.translate(record.get(headerPosition.get(0)))
从 HTML 中读取有什么问题?尝试过 StringUtilEscaper 方法,但字符仍然变成了 �。
private String getText(Part p) throws MessagingException, IOException {
if (p.isMimeType("text/*")) {
String s = (String) p.getContent();
textIsHtml = p.isMimeType("text/html");
return s;
}
}
这是我读取包含 HTML 内容的电子邮件的方式!
英文:
i am trying to read html source file which contains German characters like ä ö ü ß €
Reading using JSOUP
citAttr.nextElementSibling().text()
Encoding the string with
unicodeEscaper.translate(citAttr.nextElementSibling().text())
org.apache.commons.lang3.text.translate.UnicodeEscaper
Issue is after reading the charecters turns into �
But where as reading CSV with Encoded type UTF-8 with above unicodeEscaper saving & retriving the charecters works fine.
unicodeEscaper.translate(record.get(headerPosition.get(0)))
Whats the issue with reading from html ?? did try with StringUtilEscaper methods still the charecters turns into �
private String getText(Part p) throws MessagingException, IOException {
if (p.isMimeType("text/*")) {
String s = (String) p.getContent();
textIsHtml = p.isMimeType("text/html");
return s;
}
This is how i am reading email which have html content!
答案1
得分: 1
我刚刚在今天回答了一个类似的问题... 我想我可以简单地输入我对扩展字符集(外语字符)的了解,因为这是我编写的软件的主要组成部分之一。
- Java的内部
String
全部使用16位字符
(基本类型char
是16位基元值。名称UTF-8
有点误导,因为它用于表示16位的“Unicode空间”(使用两个8位数字)。这意味着Java(以及Java的String
)在表示整个Unicode外语字母范围时没有任何问题。 - JSoup以及几乎所有用Java编写的HTML工具,在请求下载网页时,将以Java的
String
形式返回16位字符,而不会出现任何问题!如果在查看这些范围时出现问题,那么问题可能不是下载过程,也不是JSoup或HttpUrlConnection
的设置。当你将网页保存到Java的String中时,你并没有丢失这些字符,实际上你获得了UTF-8
“免费”的内容。 - 然而:每当程序员尝试将
UTF-8 String
保存到.txt
文件或.html
文件时,如果你随后在Web浏览器中查看该内容(该文件),你可能只会看到令人讨厌的问号:�。这是因为你需要确保让你的Web浏览器知道你使用Java保存的.html
文件不是用(更旧、更短)的8位ASCII
范围来解释的。
如果你在任何Web浏览器中查看一个.html
文件,或者将该文件上传到Google Cloud Platform(或某个托管站点),你必须执行以下两项操作之一:
- 在HTML页面的
<HEAD> ... </HEAD>
部分中包含上面提到的<META>
标签:<meta charset="UTF-8">
。 - 或者在你所用的托管平台中提供一个设置,将文件标识为
'text/html, charset=UTF-8'
。在Google Cloud Platform存储桶中,有一个弹出菜单可以将此设置分配给任何文件。
英文:
I just answered a similar question today... I guess I can just type what I know about extended character sets (foreign-language characters), since that's one of the major facets of the software I write.
- Java's internal
String's
all use16-bit chars
(The primitive typechar
is a 16-bit primitive value. The nameUTF-8
is a little misleading since it is used to represent the 16-bit "Unicode Space" (using two 8-bit numbers). This means that Java (and JavaString's
) have no problems representing the entire Unicode foreign-language alphabet ranges. - JSoup, and just about any HTML tool written in Java, when asking for website pages to download, will return 16-bit characters - as Java
String's
- just fine, without any problems! If there are problems viewing these ranges, it is likely not the download process, nor a JSoup orHttpUrlConnection
setting. When you save a web-page to a String in Java, you haven't lost those characters, you essentially getUTF-8
"for free." - HOWEVER: Whenever a programmer attempts to save a
UTF-8 String
to a'.txt' File
or a'.html' File
, if you then go on to view that content (that file) in a web-browser, all you might see is that annoying question mark: �. This is because you need to make sure to let your web-browser know that the'.html' File
that you have saved using Java - is not intended to be interpreted using the (much older, much shorter)8-bit ASCII
Range.
If you view an '.html' File
in any web-browser, or upload that file to Google Cloud Platform (or some hosting site), you must do one of two things:
> - Include the <META> Tag
mentioned in the comments: <meta charset="UTF-8">
in the HTML Page's <HEAD> ... </HEAD>
section.
> - Or provide the setting in whatever hosting platform you have to identify the file as 'text/html, charset=UTF-8'
. In Google Cloud Platform Storage Buckets there is a popup menu to assign this setting to any file.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论