如何在PHP中恢复双重编码更改

huangapple go评论118阅读模式
英文:

How to restore double encoding change in PHP

问题

我有一些文本,它在两种编码之间进行了转换(非UTF-8),然后保存为UTF-8。如何使用PHP恢复编码?

如果我们只需要将文本从任何编码转换为UTF-8,一切都正常工作:

  1. $text = 'РєСѓСЂСЃ';
  2. $text = mb_convert_encoding($text, "WINDOWS-1251", "UTF-8");
  3. echo($text);
  4. // 输出: курс
  5. // 正常工作!

如果文本在两种非UTF-8编码之间进行了转换,情况会变得更加复杂。例如,从IBM866到WINDOWS-1251。

直接转换根本不起作用:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
  3. echo($text);
  4. // 输出: �??�?�?�?�?�?�?�?�?�?�? �?�??�?�?�?�?�?�?�?�?�?�?�?�?�?�? �?�?�?�?�?�?�?�?�??
  5. // 不起作用

当我添加从UTF-8到UTF-8的转换时,情况会好一些:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "utf-8");
  3. $text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
  4. $text = mb_convert_encoding($text, "utf-8", "IBM866");
  5. echo($text);
  6. // 输出: Определение ?Информационного продукта?
  7. // 差一点。应该是 "? " 和 " ?"

在某些编码组合中,没有任何选项可用。例如,从ISO-8859-1到IBM866:

  1. $text = 'à˜«¬ª¦¥­¨¥ 4';
  2. $text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
  3. echo($text);
  4. // 输出: ???????????????????? 4
  5. // 不起作用
  6. $text = 'à˜«¬ª¦¥­¨¥ 4';
  7. $text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
  8. $text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
  9. $text = mb_convert_encoding($text, "utf-8", "ISO-8859-1");
  10. echo($text);
  11. // 输出: ?????????? 4
  12. // 不起作用

为了确保原始行没有问题,我在Python中进行了相同的转换:

  1. text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗'
  2. text = text.encode('cp866').decode('windows-1251')
  3. print(text)
  4. // 输出: Определение «Информационного продукта»
  5. // 正常工作
  6. text = 'à˜«¬ª¦¥­¨¥ 4'
  7. text = text.encode('ISO-8859-1').decode('cp866')
  8. print(text)
  9. // 输出: Приложение 4
  10. // 正常工作

在PHP中是否可能获得与Python中相同的结果?

英文:

I have text that was converted between two encodings (non UTF-8) and then saved as UTF-8. How to restore the encoding using php?

Everything works fine if we just need to convert the text from any encoding to UTF-8:

  1. $text = 'РєСѓСЂСЃ';
  2. $text = mb_convert_encoding($text, "WINDOWS-1251", "UTF-8");
  3. echo($text);
  4. // OUTPUT: курс
  5. // Works!

Things get more complicated if the text has been converted between two non UTF-8 encodings. For example from IBM866 to WINDOWS-1251.

Direct conversion does not work at all:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
  3. echo($text);
  4. // OUTPUT: �??�?�?�?�?�?�?�?�?�?�? �?�??�?�?�?�?�?�?�?�?�?�?�?�?�?�? �?�?�?�?�?�?�?�?�??
  5. // Does not work

Things got better when I added conversion from and to UTF-8:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "utf-8");
  3. $text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
  4. $text = mb_convert_encoding($text, "utf-8", "IBM866");
  5. echo($text);
  6. // OUTPUT: Определение ?Информационного продукта?
  7. // Almost works. Instead of "?" should be "«" and "»"

And in some combinations of encodings no option works. For example from ISO-8859-1 to IBM866:

  1. $text = 'ਫ®¦¥­¨¥ 4';
  2. $text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
  3. echo($text);
  4. // OUTPUT: ???????????????????? 4
  5. // Does not work
  6. $text = 'ਫ®¦¥­¨¥ 4';
  7. $text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
  8. $text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
  9. $text = mb_convert_encoding($text, "utf-8", "ISO-8859-1");
  10. echo($text);
  11. // OUTPUT: ?????????? 4
  12. // Does not work

To make sure the original lines are ok, I did the same transformations in Python:

  1. text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗'
  2. text = text.encode('cp866').decode('windows-1251')
  3. print(text)
  4. // OUTPUT: Определение «Информационного продукта»
  5. // Works
  6. text = 'ਫ®¦¥­¨¥ 4'
  7. text = text.encode('ISO-8859-1').decode('cp866')
  8. print(text)
  9. // OUTPUT: Приложение 4
  10. // Works

Is it possible to get the same result in PHP as in Python?

答案1

得分: 0

Apply UConverter::transcode - 将字符串从一种字符编码转换为另一种。

> Description
>
> public static UConverter::transcode(
> string $str,
> string $toEncoding,
> string $fromEncoding,
> ?array $options = null
> ): string|false
>
> 将 strfromEncoding 转换为 toEncoding

以下脚本应用了与 Python 代码段 .encode('cp866').decode('cp1251') 完全相同的编码和解码机制。

  1. <?php
  2. $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  3. $text1 = UConverter::transcode($text0, 'IBM866', 'UTF-8');
  4. $text2 = UConverter::transcode($text1, 'UTF-8', "CP1251");
  5. var_dump($text2);
  6. ?>

Output .\SO\76752650.php

> string(74) "Определение «Информационного продукта»"

一个更紧凑的代码段(相同结果):

  1. <?php
  2. $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  3. var_dump(
  4. UConverter::transcode(
  5. UConverter::transcode(
  6. $text0, 'IBM866', 'UTF-8'), 'UTF-8', "CP1251")
  7. );
  8. ?>
英文:

Apply UConverter::transcode - Convert a string from one character encoding to another.

> Description
>
> public static UConverter::transcode(
> string $str,
> string $toEncoding,
> string $fromEncoding,
> ?array $options = null
> ): string|false
>
> Converts str from fromEncoding to toEncoding.

The following script applies exactly the same encoding and decoding mechanism like Python code snippet .encode('cp866').decode('cp1251').

  1. <?php
  2. $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  3. $text1 = UConverter::transcode($text0, 'IBM866', 'UTF-8');
  4. $text2 = UConverter::transcode($text1, 'UTF-8', "CP1251");
  5. var_dump($text2);
  6. ?>

Output .\SO\76752650.php

> string(74) "Определение «Информационного продукта»"

A more compact code snippet (the same result):

  1. <?php
  2. $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  3. var_dump(
  4. UConverter::transcode(
  5. UConverter::transcode(
  6. $text0, 'IBM866', 'UTF-8'), 'UTF-8', "CP1251")
  7. );
  8. ?>

答案2

得分: 0

感谢JosefZ的回答!我修改了他的答案,不使用额外的库:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "utf-8");
  3. $text = mb_convert_encoding($text, "utf-8", "WINDOWS-1251");
  4. echo($text);
  5. // 输出: Определение «Информационного продукта»
  6. $text = 'ਫ®¦¥­¨¥ 4';
  7. $text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
  8. $text = mb_convert_encoding($text, "utf-8", "IBM866");
  9. echo($text);
  10. // 输出: Приложение 4

请注意,我只翻译了代码中的注释和输出部分。

英文:

Thanks to JosefZ for the answer! I modified his answer to not use additional libraries:

  1. $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
  2. $text = mb_convert_encoding($text, "IBM866", "utf-8");
  3. $text = mb_convert_encoding($text, "utf-8", "WINDOWS-1251");
  4. echo($text);
  5. // OUTPUT: Определение «Информационного продукта»
  6. $text = 'ਫ®¦¥­¨¥ 4';
  7. $text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
  8. $text = mb_convert_encoding($text, "utf-8", "IBM866");
  9. echo($text);
  10. // OUTPUT: Приложение 4

huangapple
  • 本文由 发表于 2023年7月24日 16:28:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752650.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定