2023年5月15日 15:05:23go评论106阅读模式

英文:

Subsetting from serialized string in R

问题

I understand your request. Here is the translated code part:

df <- data.frame(ID = c(1,2,3),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
                 )
df <- df %>%
  mutate(
    name = substring(information, gregexpr('"', information)[[1]][3] + 1, gregexpr('"', information)[[1]][4] - 1)
  )
df

Please note that this code extracts the value between the third and fourth double quotes for each row in the "information" column.

英文:

I get a field from a SQL database in which various information is stored as a serialized object. I would like to extract the values from this string in R and save them as separate columns.

Here is a simplified example:

df &lt;- data.frame(ID = c(1,2,3),
                 information = c(&quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;) 
                 )
df &lt;- df %&gt;% 
  mutate(
    name = substring(information, gregexpr(&#39;&quot;&#39;, information)[[1]][3] +1, gregexpr(&#39;&quot;&#39;, information)[[1]][4]-1)
  )
df

I tried to select the value between the third and fourth quotes. Unfortunately, this selection only works for the first value. For the other selections, the positions are obviously taken from the first value.

[1] &quot;max&quot; &quot;pet&quot; &quot;sus&quot;

Thank you for your ideas

答案1

得分: 1

你可以使用 extract 来提取所有 'serialized' 变量：

library(tidyr)
df %>%
  extract(information,
          into = str_c("var", 1:4),
          regex = '.*"(\\w+).*"(\\w+).*"(\\w+).*"(\\w+)".*')
  ID var1  var2 var3 var4
1  1 name   max  age    8
2  2 name peter  age   10
3  3 name  susy  age  100

或者，如果你只想要 name 和 age 列，你可以使用 str_extract 和正向查找：

library(stringr)
df %>%
  mutate(name = str_extract(information, '(?<=name.{1,10}")\\w+'),
         age = str_extract(information, '(?<=age.{1,10}")\\d+'))
  ID                                  information  name age
1  1    {s:4:"name";s:3:"max";s:3:"age";s:1:"8";}   max   8
2  2 {s:4:"name";s:5:"peter";s:3:"age";s:2:"10";} peter  10
3  3 {s:4:"name";s:4:"susy";s:3:"age";s:3:"100";}  susy 100

这里 name 的正则表达式模式指定了以下操作：

(?<=name.{1,10}"): 如果你看到字符串 name 后跟至少 1 到 10 个任意字符，然后跟着 "，那么...
\\w+: ...提取此后的字母数字字符串。

英文:

You can use extract if you want to extract all the 'serialized' variables:

library(tidyr)
df %&gt;% 
  extract(information,
          into = str_c(&quot;var&quot;, 1:4),
          regex = &#39;.*&quot;(\\w+).*&quot;(\\w+).*&quot;(\\w+).*&quot;(\\w+).*&quot;&#39;)
  ID var1  var2 var3 var4
1  1 name   max  age    8
2  2 name peter  age   10
3  3 name  susy  age  100

Alternatively, if you just want the nameand agecolumns, you can use str_extract and positive look-behind:

library(stringr)
df %&gt;%
  mutate(name = str_extract(information, &#39;(?&lt;=name.{1,10}&quot;)\\w+&#39;),
         age = str_extract(information, &#39;(?&lt;=age.{1,10}&quot;)\\d+&#39;))
  ID                                  information  name age
1  1    {s:4:&quot;name&quot;;s:3:&quot;max&quot;;s:3:&quot;age&quot;;s:1:&quot;8&quot;;}   max   8
2  2 {s:4:&quot;name&quot;;s:5:&quot;peter&quot;;s:3:&quot;age&quot;;s:2:&quot;10&quot;;} peter  10
3  3 {s:4:&quot;name&quot;;s:4:&quot;susy&quot;;s:3:&quot;age&quot;;s:3:&quot;100&quot;;}  susy 100

Here the regex pattern for name specifies the following instructions:

(?<=name.{1,10}"): if you see the string namefollowed by at least 1 and at most 10 more characters of any kind and followed by " then ...
\\w+: ... extract the alphanumeric string thereafter

答案2

得分: 1

这看起来是一个序列化的PHP对象。这里是一个稍微不同的正则表达式方法，它仍然假设一致的结构，但允许我们验证关键列的一致性，并处理名称和年龄中的标点符号（将有问题的记录添加到输入数据集）:

library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(ID = c(1,2,3,4),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}",
                                 "{s:5:\"valid\";s:2:\"no\";s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
df %>%
  # 匹配除双引号之外的任何内容，这些内容位于 ':\"' 和 '\";' 之间，
  mutate(extr = str_extract_all(information, '(?<=:\")[^"]+(?=\";)')) %>%
  unnest_wider(extr, names_sep = "_")
#> # A tibble: 4 × 8
#>      ID information                    extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#>   <dbl> <chr>                          <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1     1 "{s:4:\"name\";s:3:\"max\";s:… name   max    age    8      <NA>   <NA>  
#> 2     2 "{s:4:\"name\";s:5:\"peter\";… name   peter  age    10     <NA>   <NA>  
#> 3     3 "{s:4:\"name\";s:4:\"susy\";s… name   susy   age    100    <NA>   <NA>  
#> 4     4 "{s:5:\"valid\";s:2:\"no\";s:… valid  no     name   susy   age    100

^{创建于2023年5月15日，使用reprex v2.0.2}

英文:

Looks like a serialized PHP object. Here's a slightly different regex approach, it still assumes consistent structure but allows us to verify consistency in key columns and does handle punctuation in names and ages (added offending record to input dataset):

library(dplyr)
library(tidyr)
library(stringr)
df &lt;- data.frame(ID = c(1,2,3,4),
                 information = c(&quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;}&quot;,
                                 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;,
                                 &quot;{s:5:\&quot;valid\&quot;;s:2:\&quot;no\&quot;;s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;}&quot;) 
)
df %&gt;% 
  # match anything but double-quotes that are between &#39;:&quot;&#39; and &#39;&quot;;&#39;, 
  mutate(extr = str_extract_all(information, &#39;(?&lt;=:&quot;)[^&quot;]+(?=&quot;;)&#39;)) %&gt;% 
  unnest_wider(extr, names_sep = &quot;_&quot;)
#&gt; # A tibble: 4 &#215; 8
#&gt;      ID information                    extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#&gt;   &lt;dbl&gt; &lt;chr&gt;                          &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt; 
#&gt; 1     1 &quot;{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:… name   max    age    8      &lt;NA&gt;   &lt;NA&gt;  
#&gt; 2     2 &quot;{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;… name   peter  age    10     &lt;NA&gt;   &lt;NA&gt;  
#&gt; 3     3 &quot;{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s… name   susy   age    100    &lt;NA&gt;   &lt;NA&gt;  
#&gt; 4     4 &quot;{s:5:\&quot;valid\&quot;;s:2:\&quot;no\&quot;;s:… valid  no     name   susy   age    100

<sup>Created on 2023-05-15 with reprex v2.0.2</sup>

答案3

得分: 1

用分隔符拆分并读取为分隔文本，保留我们需要的列：
    read.delim(text = gsub("&quot;;&quot;", "&quot;:&quot;", df$information, fixed = TRUE),
               sep = "&quot;:&quot;", header = FALSE)[, c(3, 6, 9, 12)]
    #     V3    V6  V9 V12
    # 1 name   max age   8
    # 2 name peter age  10
    # 3 name  susy age 100
**注意：** 有一个模式 `seq(3, 12, 3)` 用于自动化列选择。

英文:

Split on delimiter and read as delimited text, keep columns we need:

read.delim(text = gsub(&quot;;&quot;, &quot;:&quot;, df$information, fixed = TRUE),
           sep = &quot;:&quot;, header = FALSE)[, c(3, 6, 9, 12)]
#     V3    V6  V9 V12
# 1 name   max age   8
# 2 name peter age  10
# 3 name  susy age 100

Note: there is a pattern seq(3, 12, 3) to automate the column selection.

答案4

得分: 1

If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:


library(tidyverse)
unserialize_php <- function(information) {
  system2("php", c("-r", shQuote("$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));")), stdout = TRUE, stderr = "", input = information) %>% fromJSON(simplifyDataFrame = TRUE)
}
df <- tibble(
  ID = 1:3,
  information = unserialize_php(
    c("a:2:{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
      "a:2:{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
      "a:2:{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
  )
)
df
# # A tibble: 3 x 2
#      ID information$name $age 
#   <int> <chr>            <chr>
# 1     1 max              8    
# 2     2 peter            10   
# 3     3 susy             100

The critical piece here is the unserialize_php function. It forwards all values of the argument information (or, the information column of the data frame df, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2 (because stdin = TRUE). fromJSON from the jsonlite package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.

Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.

英文:


library(tidyverse)
unserialize_php &lt;- function(information) {
  system2(&quot;php&quot;, c(&quot;-r&quot;, shQuote(&quot;$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));&quot;)), stdout = TRUE, stderr = &quot;&quot;, input = information) %&gt;% fromJSON(simplifyDataFrame = TRUE)
}
df &lt;- tibble(
  ID = 1:3,
  information = unserialize_php(
    c(&quot;a:2:{s:4:\&quot;name\&quot;;s:3:\&quot;max\&quot;;s:3:\&quot;age\&quot;;s:1:\&quot;8\&quot;;};&quot;,
      &quot;a:2:{s:4:\&quot;name\&quot;;s:5:\&quot;peter\&quot;;s:3:\&quot;age\&quot;;s:2:\&quot;10\&quot;;};&quot;,
      &quot;a:2:{s:4:\&quot;name\&quot;;s:4:\&quot;susy\&quot;;s:3:\&quot;age\&quot;;s:3:\&quot;100\&quot;;};&quot;)
  )
)
df
# # A tibble: 3 &#215; 2
#      ID information$name $age 
#   &lt;int&gt; &lt;chr&gt;            &lt;chr&gt;
# 1     1 max              8    
# 2     2 peter            10   
# 3     3 susy             100

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从R中的序列化字符串进行子集选择

问题

答案1

答案2

答案3

答案4

在R中如何使用”rank”或其他类似的分组函数与条件结合使用。

分开更宽的部分，第一半成为列名，第二半成为单元格值。

保存ccf()循环输出到r中

如何根据数据帧中的分组来增加数字距离？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论