从R中的序列化字符串进行子集选择

huangapple go评论58阅读模式
英文:

Subsetting from serialized string in R

问题

I understand your request. Here is the translated code part:

df <- data.frame(ID = c(1,2,3),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
                 )

df <- df %>%
  mutate(
    name = substring(information, gregexpr('"', information)[[1]][3] + 1, gregexpr('"', information)[[1]][4] - 1)
  )

df

Please note that this code extracts the value between the third and fourth double quotes for each row in the "information" column.

英文:

I get a field from a SQL database in which various information is stored as a serialized object. I would like to extract the values ​​from this string in R and save them as separate columns.

Here is a simplified example:

df &lt;- data.frame(ID = c(1,2,3),
                 information = c(&quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;) 
                 )

df &lt;- df %&gt;% 
  mutate(
    name = substring(information, gregexpr(&#39;&quot;&#39;, information)[[1]][3] +1, gregexpr(&#39;&quot;&#39;, information)[[1]][4]-1)
  )

df

I tried to select the value between the third and fourth quotes. Unfortunately, this selection only works for the first value. For the other selections, the positions are obviously taken from the first value.

[1] &quot;max&quot; &quot;pet&quot; &quot;sus&quot;

Thank you for your ideas

答案1

得分: 1

你可以使用 extract 来提取所有 'serialized' 变量:

library(tidyr)
df %>%
  extract(information,
          into = str_c("var", 1:4),
          regex = '.*"(\\w+).*"(\\w+).*"(\\w+).*"(\\w+)".*')
  ID var1  var2 var3 var4
1  1 name   max  age    8
2  2 name peter  age   10
3  3 name  susy  age  100

或者,如果你只想要 nameage 列,你可以使用 str_extract 和正向查找:

library(stringr)
df %>%
  mutate(name = str_extract(information, '(?<=name.{1,10}")\\w+'),
         age = str_extract(information, '(?<=age.{1,10}")\\d+'))
  ID                                  information  name age
1  1    {s:4:"name";s:3:"max";s:3:"age";s:1:"8";}   max   8
2  2 {s:4:"name";s:5:"peter";s:3:"age";s:2:"10";} peter  10
3  3 {s:4:"name";s:4:"susy";s:3:"age";s:3:"100";}  susy 100

这里 name 的正则表达式模式指定了以下操作:

  • (?<=name.{1,10}"): 如果你看到字符串 name 后跟至少 1 到 10 个任意字符,然后跟着 ",那么...
  • \\w+: ...提取此后的字母数字字符串。
英文:

You can use extract if you want to extract all the 'serialized' variables:

library(tidyr)
df %&gt;% 
  extract(information,
          into = str_c(&quot;var&quot;, 1:4),
          regex = &#39;.*&quot;(\\w+).*&quot;(\\w+).*&quot;(\\w+).*&quot;(\\w+).*&quot;&#39;)
  ID var1  var2 var3 var4
1  1 name   max  age    8
2  2 name peter  age   10
3  3 name  susy  age  100

Alternatively, if you just want the nameand agecolumns, you can use str_extract and positive look-behind:

library(stringr)
df %&gt;%
  mutate(name = str_extract(information, &#39;(?&lt;=name.{1,10}&quot;)\\w+&#39;),
         age = str_extract(information, &#39;(?&lt;=age.{1,10}&quot;)\\d+&#39;))
  ID                                  information  name age
1  1    {s:4:&quot;name&quot;;s:3:&quot;max&quot;;s:3:&quot;age&quot;;s:1:&quot;8&quot;;}   max   8
2  2 {s:4:&quot;name&quot;;s:5:&quot;peter&quot;;s:3:&quot;age&quot;;s:2:&quot;10&quot;;} peter  10
3  3 {s:4:&quot;name&quot;;s:4:&quot;susy&quot;;s:3:&quot;age&quot;;s:3:&quot;100&quot;;}  susy 100

Here the regex pattern for name specifies the following instructions:

  • (?&lt;=name.{1,10}&quot;): if you see the string namefollowed by at least 1 and at most 10 more characters of any kind and followed by &quot; then ...
  • \\w+: ... extract the alphanumeric string thereafter

答案2

得分: 1

这看起来是一个序列化的PHP对象。这里是一个稍微不同的正则表达式方法,它仍然假设一致的结构,但允许我们验证关键列的一致性,并处理名称和年龄中的标点符号(将有问题的记录添加到输入数据集):

library(dplyr)
library(tidyr)
library(stringr)

df <- data.frame(ID = c(1,2,3,4),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}",
                                 "{s:5:\"valid\";s:2:\"no\";s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)

df %>%
  # 匹配除双引号之外的任何内容,这些内容位于 ':\"' 和 '\";' 之间,
  mutate(extr = str_extract_all(information, '(?<=:\")[^"]+(?=\";)')) %>%
  unnest_wider(extr, names_sep = "_")
#> # A tibble: 4 × 8
#>      ID information                    extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#>   <dbl> <chr>                          <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1     1 "{s:4:\"name\";s:3:\"max\";s:… name   max    age    8      <NA>   <NA>  
#> 2     2 "{s:4:\"name\";s:5:\"peter\";… name   peter  age    10     <NA>   <NA>  
#> 3     3 "{s:4:\"name\";s:4:\"susy\";s… name   susy   age    100    <NA>   <NA>  
#> 4     4 "{s:5:\"valid\";s:2:\"no\";s:… valid  no     name   susy   age    100

创建于2023年5月15日,使用reprex v2.0.2

英文:

Looks like a serialized PHP object. Here's a slightly different regex approach, it still assumes consistent structure but allows us to verify consistency in key columns and does handle punctuation in names and ages (added offending record to input dataset):

library(dplyr)
library(tidyr)
library(stringr)

df &lt;- data.frame(ID = c(1,2,3,4),
                 information = c(&quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;,
                                 &quot;{s:5:\&quot;valid\&quot;;s:2:\&quot;no\&quot;;s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;) 
)

df %&gt;% 
  # match anything but double-quotes that are between &#39;:&quot;&#39; and &#39;&quot;;&#39;, 
  mutate(extr = str_extract_all(information, &#39;(?&lt;=:&quot;)[^&quot;]+(?=&quot;;)&#39;)) %&gt;% 
  unnest_wider(extr, names_sep = &quot;_&quot;)
#&gt; # A tibble: 4 &#215; 8
#&gt;      ID information                    extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#&gt;   &lt;dbl&gt; &lt;chr&gt;                          &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt; 
#&gt; 1     1 &quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:… name   max    age    8      &lt;NA&gt;   &lt;NA&gt;  
#&gt; 2     2 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;… name   peter  age    10     &lt;NA&gt;   &lt;NA&gt;  
#&gt; 3     3 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s… name   susy   age    100    &lt;NA&gt;   &lt;NA&gt;  
#&gt; 4     4 &quot;{s:5:\&quot;valid\&quot;;s:2:\&quot;no\&quot;;s:… valid  no     name   susy   age    100

<sup>Created on 2023-05-15 with reprex v2.0.2</sup>

答案3

得分: 1

用分隔符拆分并读取为分隔文本,保留我们需要的列:

    read.delim(text = gsub("&quot;;&quot;", "&quot;:&quot;", df$information, fixed = TRUE),
               sep = "&quot;:&quot;", header = FALSE)[, c(3, 6, 9, 12)]
    #     V3    V6  V9 V12
    # 1 name   max age   8
    # 2 name peter age  10
    # 3 name  susy age 100

**注意:** 有一个模式 `seq(3, 12, 3)` 用于自动化列选择。
英文:

Split on delimiter and read as delimited text, keep columns we need:

read.delim(text = gsub(&quot;;&quot;, &quot;:&quot;, df$information, fixed = TRUE),
           sep = &quot;:&quot;, header = FALSE)[, c(3, 6, 9, 12)]
#     V3    V6  V9 V12
# 1 name   max age   8
# 2 name peter age  10
# 3 name  susy age 100

Note: there is a pattern seq(3, 12, 3) to automate the column selection.

答案4

得分: 1

If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:


library(tidyverse)

unserialize_php <- function(information) {
  system2("php", c("-r", shQuote("$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));")), stdout = TRUE, stderr = "", input = information) %>% fromJSON(simplifyDataFrame = TRUE)
}

df <- tibble(
  ID = 1:3,
  information = unserialize_php(
    c("a:2:{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
      "a:2:{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
      "a:2:{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
  )
)

df

# # A tibble: 3 x 2
#      ID information$name $age 
#   <int> <chr>            <chr>
# 1     1 max              8    
# 2     2 peter            10   
# 3     3 susy             100  

The critical piece here is the unserialize_php function. It forwards all values of the argument information (or, the information column of the data frame df, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2 (because stdin = TRUE). fromJSON from the jsonlite package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.

Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.

英文:

If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:



library(tidyverse)

unserialize_php &lt;- function(information) {
  system2(&quot;php&quot;, c(&quot;-r&quot;, shQuote(&quot;$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));&quot;)), stdout = TRUE, stderr = &quot;&quot;, input = information) %&gt;% fromJSON(simplifyDataFrame = TRUE)
}

df &lt;- tibble(
  ID = 1:3,
  information = unserialize_php(
    c(&quot;a:2:{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;};&quot;,
      &quot;a:2:{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;};&quot;,
      &quot;a:2:{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;};&quot;)
  )
)

df

# # A tibble: 3 &#215; 2
#      ID information$name $age 
#   &lt;int&gt; &lt;chr&gt;            &lt;chr&gt;
# 1     1 max              8    
# 2     2 peter            10   
# 3     3 susy             100  

The critical piece here is the unserialize_php function. It forwards all values of the argument information (or, the information column of the data frame df, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2 (because stdin = TRUE). fromJSON from the jsonlite package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.

Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.

huangapple
  • 本文由 发表于 2023年5月15日 15:05:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76251612.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定