处理超出范围的十六进制/Unicode

huangapple go评论48阅读模式
英文:

Handling Out of Range Hex/Unicode

问题

#[no_mangle]
pub extern "C" fn some_function(name: *const c_char, text: *const c_char) {
    unsafe {
        let name = CStr::from_ptr(name).to_str().unwrap();
        let text = CStr::from_ptr(text).to_str().unwrap();

        // the rest
    }
}
英文:

I'm working with a Rust cdylib crate that I'm referencing and using in C++.

#[no_mangle]
pub extern "C" fn some_function(name: *const c_char, text: *const c_char) {
    unsafe {
        let name = CStr::from_ptr(name).to_str().unwrap();
        let text = CStr::from_ptr(text).to_str().unwrap();

        // the rest
    }
}

When this function receives the character ±, it panics when attempting to get the text from the pointer. I'm passing this character in as a c_str() in C++ from a std::string:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src\lib.rs:102:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is there any way that I can properly handle this character in Rust? I don't need to manipulate it in any way, realistically this library is simple acting as a middle man, and just needs to pass it along.

When I use this to view the bytes I'm receiving:

let raw = CStr::from_ptr(text);

println!("Bytes: {:?}", raw.to_bytes_with_nul());

I get:

Bytes: [177, 0]

答案1

得分: 1

这是我复现您的问题的方法:

use std::ffi::CStr;

fn main() {
    let raw_data: &[u8] = &[177, 0];
    let raw = unsafe { CStr::from_ptr(raw_data.as_ptr().cast()) };

    println!("字节: {:?}", raw.to_bytes_with_nul());

    let string = raw.to_str().unwrap();
    println!("{}", string);
}
字节: [177, 0]
线程 'main' 在 'src\main.rs' 的 9 行处恐慌: 在 `Err` 值上调用了 `Result::unwrap()`: Utf8Error { valid_up_to: 0, error_len: Some(1) }

这里的问题是to_str()期望一个有效的UTF-8字符串。[177] 不是有效的 UTF-8。有效的 UTF-8 版本应该是:

println!("{:?}", "±".as_bytes());
[194, 177]

你的字符串似乎是以不同的方式编码的,例如 Windows-1252。我将简单地假设是这样,因为在不了解更多关于你的代码的信息时,无法确定。但这是很有可能的,因为这是西方世界中 Windows 的默认编码。

在不同编码之间转换的最简单方法是使用 crate encoding_rs。Rust 本身只内置了对 UTF-8 的支持,因此您需要使用外部的 crates,而这是最成熟的之一。

use std::ffi::{c_char, CStr};
use encoding_rs::WINDOWS_1252;

fn main() {
    let raw_data: *const c_char = (&[177u8, 0u8]).as_ptr().cast();

    let raw = unsafe { CStr::from_ptr(raw_data) };

    println!("字节: {:?}", raw.to_bytes_with_nul());

    let (string, actual_encoding, errors) = WINDOWS_1252.decode(raw.to_bytes());

    println!("字符串: {:?}", string);
    println!("实际编码: {:?}", actual_encoding);
    println!("错误: {}", errors);
}
字节: [177, 0]
字符串: ""±""
实际编码: Encoding { windows-1252 }
错误: false
英文:

Here is how I reproduced your problem:

use std::ffi::CStr;

fn main() {
    let raw_data: &[u8] = &[177, 0];
    let raw = unsafe { CStr::from_ptr(raw_data.as_ptr().cast()) };

    println!("Bytes: {:?}", raw.to_bytes_with_nul());

    let string = raw.to_str().unwrap();
    println!("{}", string);
}
Bytes: [177, 0]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src\main.rs:9:31

The problem here is that to_str() expects a valid UTF-8 string. [177] is not valid UTF-8. The valid UTF-8 version would be:

println!("{:?}", "±".as_bytes());
[194, 177]

Yours seems to be encoded differently, for example Windows-1252. I will simply assume so, because without more knowledge about your code, there is no way of telling for sure. But it is very likely, as this is the default encoding for Windows in the western world.

The easiest way to convert between encodings is via the crate encoding_rs. Rust itself only has UTF-8 support built in, so you need to use external crates for it, and this is the most established one.

use std::ffi::{c_char, CStr};

use encoding_rs::WINDOWS_1252;

fn main() {
    let raw_data: *const c_char = (&[177u8, 0u8]).as_ptr().cast();

    let raw = unsafe { CStr::from_ptr(raw_data) };

    println!("Bytes: {:?}", raw.to_bytes_with_nul());

    let (string, actual_encoding, errors) = WINDOWS_1252.decode(raw.to_bytes());

    println!("String: {:?}", string);
    println!("Actual encoding: {:?}", actual_encoding);
    println!("Errors: {}", errors);
}
Bytes: [177, 0]
String: "±"
Actual encoding: Encoding { windows-1252 }
Errors: false

huangapple
  • 本文由 发表于 2023年3月31日 21:02:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75898868.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定