如何将整个日期字符串列转换为整数

huangapple go评论57阅读模式
英文:

how to covert a whole column of date strings to integers

问题

我有一个类似这样的数据框架
我想将第一列名为 'time' 的日期字符串转换成格式为 "YYYYMMDD" 的仅含数字的列(例如:20230531),并将其类型转换为 u64。
我尝试构建一个函数来实现这个目标,但我遇到了困难,特别是如何删除日期字符串中的连字符。

pub fn convert_str_to_num(df: &DataFrame) -> Result<DataFrame, PolarsError> {
    let mut final_df = df.clone();
    let col_name = String::from("time");
    let time_col = df.column(col_name.as_str())?;
    let mut new_time_col = time_col.clone().replace("-", "")?;
    // 用新列替换旧列
    final_df.replace(col_name.as_str(), new_time_col.as_mut())?;
    Ok(final_df)
}

但是这样返回错误:

error[E0599]: 在当前作用域中找不到名为 `replace` 的方法,用于类型为 `polars::prelude::Series` 的结构体 `Series`
  --&gt; src/main.rs:13:45
   |
13 |     let mut new_time_col = time_col.clone().replace("-", "")?;
   |                                             ^^^^^^^ Series 中找不到该方法
英文:

i have a dataframe that look like this
如何将整个日期字符串列转换为整数

i would like to convert the first column 'time' with date strings into a column with only numbers in format of "YYYYMMDD"(e.g.: 20230531) in u64.

i tried building up a function to do this but i am struggling and espcially in how to remove the hyphens in date strings.

pub fn convert_str_to_num(df: &amp;DataFrame) -&gt; Result&lt;DataFrame, PolarsError&gt; {
    let mut final_df = df.clone();
    let col_name = String::from(&quot;time&quot;);
    let time_col = df.column(col_name.as_str())?;
    let mut new_time_col = time_col.clone().replace(&quot;-&quot;, &quot;&quot;)?;
    // replace the old column with the new one
    final_df.replace(col_name.as_str(), new_time_col.as_mut())?;
    Ok(final_df)
}

somehow this returns

error[E0599]: no method named `replace` found for struct `polars::prelude::Series` in the current scope
  --&gt; src/main.rs:13:45
   |
13 |     let mut new_time_col = time_col.clone().replace(&quot;-&quot;, &quot;&quot;)?;
   |                                             ^^^^^^^ method not found in `Series`

答案1

得分: 2

以下是翻译好的部分:

假设你已经从第一列获取了日期字符串,我将会使用以下示例中的一个函数。

它开始通过拆分字符串片段,按照'-'分隔符进行操作。
这提供了一个迭代器,提供了输入字符串片段的子片段,但不涉及原始字符串的任何复制。

在每次迭代中,我们尝试解析传递的子片段,以提取一个u64值。
如果失败,函数会借助?来报告错误。
当成功时,我们简单地按照你期望的方式更新该值(100×100×年 + 100×月 + 日)。

最后,我们必须确保已解析了三部分(年、月、日),如果不是这样,就报告一个错误。

最终,在这三次迭代期间更新的值就是期望的结果。

请注意,我们可以添加一些关于月份和日期的界限检查。

fn txt_date_to_u64(
    txt_date: &str
) -> Result<u64, Box<dyn std::error::Error>> {
    let mut part_count = 0;
    let mut year_month_day = 0;
    for part in txt_date.split('-') {
        year_month_day = year_month_day * 100 + str::parse::<u64>(part)?;
        part_count += 1;
    }
    if part_count != 3 {
        Err("unexpected date")?;
    }
    Ok(year_month_day)
}

fn main() {
    for txt_date in [
        "2023-05-31",
        "what???",
        "2004-01-07",
        "2004-01",
        "2004-01-07-19",
    ] {
        match txt_date_to_u64(txt_date) {
            Ok(d) => {
                println!("{:?} ~~> {:?}", txt_date, d);
            }
            Err(e) => {
                println!("{:?} ~~> !!! {:?} !!!", txt_date, e);
            }
        }
    }
}
/*
"2023-05-31" ~~> 20230531
"what???" ~~> !!! ParseIntError { kind: InvalidDigit } !!!
"2004-01-07" ~~> 20040107
"2004-01" ~~> !!! "unexpected date" !!!
"2004-01-07-19" ~~> !!! "unexpected date" !!!
*/
英文:

Assuming you already obtained the date string from the first column, I would use a function as in the following example.

It starts by splitting the string-slice according to the &#39;-&#39; separator.
This provides an iterator delivering sub-slices of the input string-slice, but does not involve any copy of any part of the original string.

At each iteration, we try to parse the delivered sub-slice in order to extract a u64 value.
If this fails, the function reports the error thanks to ?.
When it succeeds, we simply update the value as you expect (100×100×year + 100×month + day).

In the end, we must ensure three parts have been parsed (year, month, day) and report an error if it is not the case.

Finally, the value which was updated during the three iterations is the expected result.

Note that we could add some bounds checking about the month and the day.

fn txt_date_to_u64(
    txt_date: &amp;str
) -&gt; Result&lt;u64, Box&lt;dyn std::error::Error&gt;&gt; {
    let mut part_count = 0;
    let mut year_month_day = 0;
    for part in txt_date.split(&#39;-&#39;) {
        year_month_day = year_month_day * 100 + str::parse::&lt;u64&gt;(part)?;
        part_count += 1;
    }
    if part_count != 3 {
        Err(&quot;unexpected date&quot;)?;
    }
    Ok(year_month_day)
}

fn main() {
    for txt_date in [
        &quot;2023-05-31&quot;,
        &quot;what???&quot;,
        &quot;2004-01-07&quot;,
        &quot;2004-01&quot;,
        &quot;2004-01-07-19&quot;,
    ] {
        match txt_date_to_u64(txt_date) {
            Ok(d) =&gt; {
                println!(&quot;{:?} ~~&gt; {:?}&quot;, txt_date, d);
            }
            Err(e) =&gt; {
                println!(&quot;{:?} ~~&gt; !!! {:?} !!!&quot;, txt_date, e);
            }
        }
    }
}
/*
&quot;2023-05-31&quot; ~~&gt; 20230531
&quot;what???&quot; ~~&gt; !!! ParseIntError { kind: InvalidDigit } !!!
&quot;2004-01-07&quot; ~~&gt; 20040107
&quot;2004-01&quot; ~~&gt; !!! &quot;unexpected date&quot; !!!
&quot;2004-01-07-19&quot; ~~&gt; !!! &quot;unexpected date&quot; !!!
*/

答案2

得分: 1

fn convert_str_to_int(mut df: DataFrame, date_col_name: &str) -> Result<DataFrame, PolarsError> {
    // 获取日期列作为一个 Series
    let date_col = df.column(date_col_name)?;
    // 将每个日期字符串转换为形如 "YYYYMMDD" 的无符号 32 位整数值
    let int_values = date_col
        .utf8()?
        .into_iter()
        .map(|date_str| {
            let int_str = Cow::from(date_str.unwrap().replace('-', ""));
            // 将整数值解析为 u32
            int_str.parse::<u32>().unwrap()
        })
        .collect::<Vec<_>>();
    // 创建一个新的 UInt32Chunked 来替换原始列
    let u32_col = UInt32Chunked::new(date_col_name, int_values).into_series();
    // 创建一个包含转换后的无符号 32 位整数列的新 DataFrame
    df.replace(date_col_name, u32_col)?;
    Ok(df)
}
英文:

turns out i have solved my own question.

fn convert_str_to_int(mut df: DataFrame, date_col_name: &amp;str) -&gt; Result&lt;DataFrame, PolarsError&gt; {
    // Get the date column as a Series
    let date_col = df.column(date_col_name)?;
    // Convert each date string into an unsigned 32-bit integer value in the form of &quot;YYYYMMDD&quot;
    let int_values = date_col
        .utf8()?
        .into_iter()
        .map(|date_str| {
            let int_str = Cow::from(date_str.unwrap().replace(&#39;-&#39;, &quot;&quot;));
            // Parse the integer value as u32
            int_str.parse::&lt;u32&gt;().unwrap()
        })
        .collect::&lt;Vec&lt;_&gt;&gt;();
    // Create a new UInt32Chunked to replace the original column
    let u32_col = UInt32Chunked::new(date_col_name, int_values).into_series();
    // Create a new DataFrame with the converted unsigned 32-bit integer column
    df.replace(date_col_name, u32_col)?;
    Ok(df)
}

huangapple
  • 本文由 发表于 2023年6月8日 20:18:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431805.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定