英文:
How to restore double encoding change in PHP
问题
我有一些文本,它在两种编码之间进行了转换(非UTF-8),然后保存为UTF-8。如何使用PHP恢复编码?
如果我们只需要将文本从任何编码转换为UTF-8,一切都正常工作:
$text = 'РєСѓСЂСЃ';
$text = mb_convert_encoding($text, "WINDOWS-1251", "UTF-8");
echo($text);
// 输出: курс
// 正常工作!
如果文本在两种非UTF-8编码之间进行了转换,情况会变得更加复杂。例如,从IBM866到WINDOWS-1251。
直接转换根本不起作用:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
echo($text);
// 输出: �??�?�?�?�?�?�?�?�?�?�? �?�??�?�?�?�?�?�?�?�?�?�?�?�?�?�? �?�?�?�?�?�?�?�?�??
// 不起作用
当我添加从UTF-8到UTF-8的转换时,情况会好一些:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "utf-8");
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
$text = mb_convert_encoding($text, "utf-8", "IBM866");
echo($text);
// 输出: Определение ?Информационного продукта?
// 差一点。应该是 "? " 和 " ?"
在某些编码组合中,没有任何选项可用。例如,从ISO-8859-1到IBM866:
$text = '૬ª¦¥¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
echo($text);
// 输出: ???????????????????? 4
// 不起作用
$text = '૬ª¦¥¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
$text = mb_convert_encoding($text, "utf-8", "ISO-8859-1");
echo($text);
// 输出: ?????????? 4
// 不起作用
为了确保原始行没有问题,我在Python中进行了相同的转换:
text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗'
text = text.encode('cp866').decode('windows-1251')
print(text)
// 输出: Определение «Информационного продукта»
// 正常工作
text = '૬ª¦¥¨¥ 4'
text = text.encode('ISO-8859-1').decode('cp866')
print(text)
// 输出: Приложение 4
// 正常工作
在PHP中是否可能获得与Python中相同的结果?
英文:
I have text that was converted between two encodings (non UTF-8) and then saved as UTF-8. How to restore the encoding using php?
Everything works fine if we just need to convert the text from any encoding to UTF-8:
$text = 'РєСѓСЂСЃ';
$text = mb_convert_encoding($text, "WINDOWS-1251", "UTF-8");
echo($text);
// OUTPUT: курс
// Works!
Things get more complicated if the text has been converted between two non UTF-8 encodings. For example from IBM866 to WINDOWS-1251.
Direct conversion does not work at all:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
echo($text);
// OUTPUT: �??�?�?�?�?�?�?�?�?�?�? �?�??�?�?�?�?�?�?�?�?�?�?�?�?�?�? �?�?�?�?�?�?�?�?�??
// Does not work
Things got better when I added conversion from and to UTF-8:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "utf-8");
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
$text = mb_convert_encoding($text, "utf-8", "IBM866");
echo($text);
// OUTPUT: Определение ?Информационного продукта?
// Almost works. Instead of "?" should be "«" and "»"
And in some combinations of encodings no option works. For example from ISO-8859-1 to IBM866:
$text = 'ਫ®¦¥­¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
echo($text);
// OUTPUT: ???????????????????? 4
// Does not work
$text = 'ਫ®¦¥­¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
$text = mb_convert_encoding($text, "utf-8", "ISO-8859-1");
echo($text);
// OUTPUT: ?????????? 4
// Does not work
To make sure the original lines are ok, I did the same transformations in Python:
text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗'
text = text.encode('cp866').decode('windows-1251')
print(text)
// OUTPUT: Определение «Информационного продукта»
// Works
text = 'ਫ®¦¥­¨¥ 4'
text = text.encode('ISO-8859-1').decode('cp866')
print(text)
// OUTPUT: Приложение 4
// Works
Is it possible to get the same result in PHP as in Python?
答案1
得分: 0
Apply UConverter::transcode
- 将字符串从一种字符编码转换为另一种。
> Description
>
> public static UConverter::transcode(
> string $str,
> string $toEncoding,
> string $fromEncoding,
> ?array $options = null
> ): string|false
>
> 将 str
从 fromEncoding
转换为 toEncoding
。
以下脚本应用了与 Python 代码段 .encode('cp866').decode('cp1251')
完全相同的编码和解码机制。
<?php
$text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text1 = UConverter::transcode($text0, 'IBM866', 'UTF-8');
$text2 = UConverter::transcode($text1, 'UTF-8', "CP1251");
var_dump($text2);
?>
Output .\SO\76752650.php
> string(74) "Определение «Информационного продукта»"
一个更紧凑的代码段(相同结果):
<?php
$text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
var_dump(
UConverter::transcode(
UConverter::transcode(
$text0, 'IBM866', 'UTF-8'), 'UTF-8', "CP1251")
);
?>
英文:
Apply UConverter::transcode
- Convert a string from one character encoding to another.
> Description
>
> public static UConverter::transcode(
> string $str,
> string $toEncoding,
> string $fromEncoding,
> ?array $options = null
> ): string|false
>
> Converts str
from fromEncoding
to toEncoding
.
The following script applies exactly the same encoding and decoding mechanism like Python code snippet .encode('cp866').decode('cp1251')
.
<?php
$text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text1 = UConverter::transcode($text0, 'IBM866', 'UTF-8');
$text2 = UConverter::transcode($text1, 'UTF-8', "CP1251");
var_dump($text2);
?>
Output .\SO\76752650.php
> string(74) "Определение «Информационного продукта»"
A more compact code snippet (the same result):
<?php
$text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
var_dump(
UConverter::transcode(
UConverter::transcode(
$text0, 'IBM866', 'UTF-8'), 'UTF-8', "CP1251")
);
?>
答案2
得分: 0
感谢JosefZ的回答!我修改了他的答案,不使用额外的库:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "utf-8");
$text = mb_convert_encoding($text, "utf-8", "WINDOWS-1251");
echo($text);
// 输出: Определение «Информационного продукта»
$text = 'ਫ®¦¥¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
$text = mb_convert_encoding($text, "utf-8", "IBM866");
echo($text);
// 输出: Приложение 4
请注意,我只翻译了代码中的注释和输出部分。
英文:
Thanks to JosefZ for the answer! I modified his answer to not use additional libraries:
$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "utf-8");
$text = mb_convert_encoding($text, "utf-8", "WINDOWS-1251");
echo($text);
// OUTPUT: Определение «Информационного продукта»
$text = 'ਫ®¦¥­¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
$text = mb_convert_encoding($text, "utf-8", "IBM866");
echo($text);
// OUTPUT: Приложение 4
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论