在使用映射时进行字符串编码

huangapple go评论73阅读模式
英文:

Encoding string during using maps

问题

以下是翻译好的部分:

我有一种感觉,我的字符串(带有变音字符)在我的类中使用不同的编码,在哈希映射中使用不同的编码(也适用于其他映射实例)。字符串在我的类中被定义,我试图将其用作映射中的键,还放入了一些值,当我尝试通过键获取此值时,它不起作用。有趣的是 - 在IntelliJ进行评估时按预期工作。

一些具体细节:

IntelliJ IDEA 2019.3.1(社区版)
版本号:IC-193.5662.53,
构建日期:2019年12月18日,
运行时版本:11.0.5+10-b520.17 amd64
VM:由JetBrains s.r.o开发的OpenJDK 64位服务器VM,Windows 10 10.0 GC:ParNew,ConcurrentMarkSweep 内存:1986M 核心:4 注册表:非捆绑插件:

使用的SDK Java 1.8.0_231

为了检查是否可重复出现,我创建了这个Junit:

@Test
public void test() {
    Map<String, String> map = new TreeMap<>();
    map.put("język", "polski");
    String res = map.get("język");
    System.out.println(res);
}

在将单词“język”放入哈希映射时,它被转换为“j?zyk”,但在从映射中获取它时,它也被转换为“j?zyk”,所以一切看起来都没问题。但在我的生产代码中,情况更加复杂。我使用以下代码从字符串列表创建了映射:

private Map<String, String> getBookDetails(HtmlElement from) {
    HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
    return Arrays.stream(bookDetails.asText().split(BOOK_DETAILS_SEPARATOR))
            .collect(MappingErrors.collector());
}

bookDetails.asXml:

<!-- 略 -->

缺少的变量:

private String BOOK_DETAILS_SEPARATOR = "\r\n";

static final DefinedHtmlElement BOOK_DETAILS =
        new DefinedHtmlElement("div", "id", "book-details");

DefinedHtmlElement 内部类:

static class DefinedHtmlElement {
    // 略
}

以及collector:

private static final class MappingErrors {
    // 略
}

有趣的是,在将键/值放入映射中时,它不会转换为问号版本,而是按照预期的方式写入。但是,当我尝试通过键获取值时,键字符串被转换,找不到任何匹配的键,代码无法正常工作。我尝试将“Język:”作为键,并获取“JÄ™zyk:”。同样,在正常运行或调试时,我无法通过键找到值,但在评估过程中,它按预期工作。

我不知道要找到根本原因。我检查了所有文件都具有相同的编码(utf-8和windows 1252在这种情况下的工作方式相同),所有项目都设置了相同的编码,没有输入文件,只是从网页上进行了爬取,并通过com.gargoylesoftware.htmlunit.html.HtmlElement获取了字符串(如果这很重要)。是否有人知道在哪里找到根本原因?编码是否是正确的线索,还是完全不同的问题?当然,我可以创建绕过方法,将所有变音字符替换为正常字符,但我想了解发生了什么。

更新:
我发现来自gargoylesoftware的数据不同。它不是填充映射的方法,与映射无关(实际上,映射是此现象首次可见的地方)。我稍微修改了代码:

private Map<String, String> getBookDetails(HtmlElement from) {
    HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
    String[] split = bookDetails.asText().split(BOOK_DETAILS_SEPARATOR);
    Map<String, String> mapa = new HashMap<>();
    for (int i = 0; i < split.length - 1; i += 2) {
        mapa.put(split[i].trim(), split[i + 1].trim());
        if (split[i].trim().compareTo("Język:") == 0) {
            System.out.println("test");
        }
    }
    mapa.put("Język:", "TEST");
    return mapa;
}

在 if 中的条件永远不会为真。仍然只在评估过程中为真,但带有 println 的行永远不会被执行。mapa 对象如下所示:

"Data 1. wyd. pol.:" -> "2016-05-16"
"Liczba stron:" -> "20"
"Data wydania:" -> "2016-05-16"
"Tłumacz:" -> "Ryszard Turczyn"
"Język:" -> "TEST"
"Tytuł oryginału:" -> "Wat?"
"Język:" -> "polski"
"Wydawnictwo:" -> "Wydawnictwo Adamada"
"ISBN:" -> "9788374206600"

所以,手动添加的条目以某种方式被更改为“TM”版本。但没关系,因为在从此映射中获取值时,相同的更改会发生,因此值是正确的。但是为什么手动放入的字符串与来自gargoylesoftware的字符串之间存在差异呢?

英文:

I have a feeling that my string (with diacritic characters) is in different encoding in my class and in different in hashmap (also "work" for other map instances) String is defined in my class, i try to use it as key in map, put there also some value, and when i try to get this value by key, it's not working. Fun thing - working as expected during intellij evaluate.
Some specifics:

> IntelliJ IDEA 2019.3.1 (Community Edition)
> Build #IC-193.5662.53,
> built on December 18,
> 2019 Runtime version: 11.0.5+10-b520.17 amd64
> VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o Windows 10 10.0 GC:
> ParNew, ConcurrentMarkSweep Memory: 1986M Cores: 4 Registry:
> Non-Bundled Plugins:

> used SDK Java 1.8.0_231

To check that case is repeatable i create this junit:

@Test
public void test() {
    Map&lt;String, String&gt; map = new TreeMap&lt;&gt;();
    map.put(&quot;język&quot;, &quot;polski&quot;);
    String res = map.get(&quot;język&quot;);
    System.out.println(res);
}

During putting in hashmap word "język" is converted to "j?zyk" but during getting it from map, it's also converted to "j?zyk" so everything looks fine. But in my productive code it's more complicated. I created map from list of strings using this code:

 private Map&lt;String, String&gt; getBookDetails(HtmlElement from) {
        HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
        return Arrays.stream(bookDetails.asText().split(BOOK_DETAILS_SEPARATOR))
             .collect(MappingErrors.collector());
    }

bookDetails.asXml:

&lt;div class=&quot;collapse d-xs-none&quot; id=&quot;book-details&quot;&gt;
  &lt;dl&gt;
    &lt;dt&gt;
      
                            Tytuł oryginału:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            Wat?                        
    &lt;/dd&gt;
    &lt;dt&gt;
      
                            Data wydania:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            2016-05-16                        
    &lt;/dd&gt;
    &lt;dt data-toggle=&quot;tooltip&quot; title=&quot;Data pierwszego wydania polskiego&quot;&gt;
      
                            Data 1. wyd. pol.:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            2016-05-16                        
    &lt;/dd&gt;
    &lt;dt&gt;
      
                            Liczba stron:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            20                        
    &lt;/dd&gt;
    &lt;dt&gt;
      
                            Język:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            polski                        
    &lt;/dd&gt;
    &lt;dt&gt;
      
                            ISBN:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      
                            9788374206600                        
    &lt;/dd&gt;
    &lt;dt&gt;
      
                            Tłumacz:
                        
    &lt;/dt&gt;
    &lt;dd&gt;
      &lt;a href=&quot;https://lubimyczytac.pl/tlumacz/10593/ryszard-turczyn&quot;&gt;
        Ryszard Turczyn
      &lt;/a&gt;
    &lt;/dd&gt;
    &lt;dt class=&quot;d-lg-none&quot;&gt;
      
                            Wydawnictwo:
                        
    &lt;/dt&gt;
    &lt;dd class=&quot;d-lg-none&quot;&gt;
      &lt;a href=&quot;https://lubimyczytac.pl/wydawnictwo/13832/wydawnictwo-adamada/ksiazki&quot;&gt;
        Wydawnictwo Adamada
      &lt;/a&gt;
    &lt;/dd&gt;
  &lt;/dl&gt;
&lt;/div&gt;

missing variables

private String BOOK_DETAILS_SEPARATOR = &quot;\r\n&quot;;

static final DefinedHtmlElement BOOK_DETAILS =
            new DefinedHtmlElement(&quot;div&quot;, &quot;id&quot;, &quot;book-details&quot;);

DefinedHtmlElement inner class:

static class DefinedHtmlElement {
        String elementName;
        String attributeName;
        String attributeValue;

        DefinedHtmlElement (String elementName, String attributeName, String attributeValue) {
            this.attributeName = attributeName;
            this.elementName = elementName;
            this.attributeValue = attributeValue;
        }

        public String getAttributeName() {
            return attributeName;
        }

        public String getAttributeValue() {
            return attributeValue;
        }

        public String getElementName() {
            return elementName;
        }

        public HtmlElement getFirst(HtmlElement element) {
            return element
                    .getElementsByAttribute(elementName, attributeName, attributeValue)
                    .stream().findFirst().orElse(null);
        }
    }

And collector:

private static final class MappingErrors {

        private static int counter = 1;

        private Map&lt;String, String&gt; map = new TreeMap&lt;&gt;();

        private String first;
        private String second;

        public void accept(String str) {
            first = second;
            second = str;
            if (first != null &amp;&amp; counter % 2 == 0) {
                map.put(first.trim(), second.trim());
            }
            counter++;
        }

        public MappingErrors combine(MappingErrors other) {
            throw new UnsupportedOperationException(&quot;Parallel Stream not supported&quot;);
        }

        public Map&lt;String, String&gt; finish() {
            return map;
        }

        public static Collector&lt;String, ?, Map&lt;String, String&gt;&gt; collector() {
            return Collector.of(MappingErrors::new, MappingErrors::accept, MappingErrors::combine, 
             MappingErrors::finish);
        }

    }

Fun Fact is that during putting into key/value into map it's not converted to question mark version, but it's write as it should be. And when I try to get value by key, key string is converted, any matching key is not find, and code is not working. I try to work with word "Język:" as a key, and get "JÄ>trade mark sign<zyk:". Again during normal run or debug i can't find value by key, but during evaluating it's working as expected.

I have no idea where find root cause. I check that all files have the same encoding (utf-8 and windows 1252 working the same way in this case) all project have set the same encoding, there is no input files, only scraping from webpage, and getting String by com.gargoylesoftware.htmlunit.html.HtmlElement if it's important. Has anyone any idea where to find root cause? Is encoding right clue, or it's something totally different? Of course i can create walkaround to replace all diacritics characters to normal, but i want to understand what is happening

UPDATE:
I find out that data from gargoylesoftware are different. It's not a way of filling map, it's not connected to map (in fact map is first place where this is visible). I modify a little code:

private Map&lt;String, String&gt; getBookDetails(HtmlElement from) {
        HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
        String[] split = bookDetails.asText().split(BOOK_DETAILS_SEPARATOR);
        Map&lt;String, String&gt; mapa = new HashMap&lt;&gt;();
        for (int i=0;i&lt;split.length-1;i+=2) {
            mapa.put(split[i].trim(), split[i+1].trim());
            if (split[i].trim().compareTo(&quot;Język:&quot;) == 0) {
                System.out.println(&quot;test&quot;);
            }
        }
        mapa.put(&quot;Język:&quot;,&quot;TEST&quot;);
        return mapa;
}

Condition in if is never true. Still it's true only during evaluating, but line with println will never be reached. Object mapa looks like this:

&quot;Data 1. wyd. pol.:&quot; -&gt; &quot;2016-05-16&quot;
&quot;Liczba stron:&quot; -&gt; &quot;20&quot;
&quot;Data wydania:&quot; -&gt; &quot;2016-05-16&quot;
&quot;Tłumacz:&quot; -&gt; &quot;Ryszard Turczyn&quot;
&quot;J&#196;™zyk:&quot; -&gt; &quot;TEST&quot;
&quot;Tytuł oryginału:&quot; -&gt; &quot;Wat?&quot;
&quot;Język:&quot; -&gt; &quot;polski&quot;
&quot;Wydawnictwo:&quot; -&gt; &quot;Wydawnictwo Adamada&quot;
&quot;ISBN:&quot; -&gt; &quot;9788374206600&quot;

So manually added entry was somehow changed to "TM" version. But it's ok, because during getting value from this map the same change take place, so value is correct. But why there is a difference beetween manually put string, and this from gargoylesoftware?

答案1

得分: 1

我找到了!这是关于在IntelliJ、Windows和网页中的编码复杂关系。HtmlElement中的数据是utf8,String是utf16,Windows有它自己的编码,而IntelliJ则是这些编码的某种组合。我稍微尝试了一下String构造函数,并找到了正确的组合。

new String(labelFromHtmlElement.getBytes("UTF-8"), "windows-1252");

带有变音符号字符的编程可能会复杂一些 在使用映射时进行字符串编码

英文:

I found it! It's complicated relations of encoding in intellij, windows and web page. Data in HtmlElement has utf8, String has utf16, windows has his own, and intellij has some combination of all of those. I was playing a little with String constructor and find out the right combination.

new String(labelFromHtmlElement.getBytes(&quot;UTF-8&quot;), &quot;windows-1252&quot;);

Programming with diacritics characters could be complicated:)

huangapple
  • 本文由 发表于 2020年10月5日 16:25:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/64204960.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定