英文:
Subsetting from serialized string in R
问题
I understand your request. Here is the translated code part:
df <- data.frame(ID = c(1,2,3),
information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
"{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
"{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
df <- df %>%
mutate(
name = substring(information, gregexpr('"', information)[[1]][3] + 1, gregexpr('"', information)[[1]][4] - 1)
)
df
Please note that this code extracts the value between the third and fourth double quotes for each row in the "information" column.
英文:
I get a field from a SQL database in which various information is stored as a serialized object. I would like to extract the values from this string in R and save them as separate columns.
Here is a simplified example:
df <- data.frame(ID = c(1,2,3),
information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
"{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
"{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
df <- df %>%
mutate(
name = substring(information, gregexpr('"', information)[[1]][3] +1, gregexpr('"', information)[[1]][4]-1)
)
df
I tried to select the value between the third and fourth quotes. Unfortunately, this selection only works for the first value. For the other selections, the positions are obviously taken from the first value.
[1] "max" "pet" "sus"
Thank you for your ideas
答案1
得分: 1
你可以使用 extract
来提取所有 'serialized' 变量:
library(tidyr)
df %>%
extract(information,
into = str_c("var", 1:4),
regex = '.*"(\\w+).*"(\\w+).*"(\\w+).*"(\\w+)".*')
ID var1 var2 var3 var4
1 1 name max age 8
2 2 name peter age 10
3 3 name susy age 100
或者,如果你只想要 name
和 age
列,你可以使用 str_extract
和正向查找:
library(stringr)
df %>%
mutate(name = str_extract(information, '(?<=name.{1,10}")\\w+'),
age = str_extract(information, '(?<=age.{1,10}")\\d+'))
ID information name age
1 1 {s:4:"name";s:3:"max";s:3:"age";s:1:"8";} max 8
2 2 {s:4:"name";s:5:"peter";s:3:"age";s:2:"10";} peter 10
3 3 {s:4:"name";s:4:"susy";s:3:"age";s:3:"100";} susy 100
这里 name
的正则表达式模式指定了以下操作:
(?<=name.{1,10}")
: 如果你看到字符串name
后跟至少 1 到 10 个任意字符,然后跟着"
,那么...\\w+
: ...提取此后的字母数字字符串。
英文:
You can use extract
if you want to extract all the 'serialized' variables:
library(tidyr)
df %>%
extract(information,
into = str_c("var", 1:4),
regex = '.*"(\\w+).*"(\\w+).*"(\\w+).*"(\\w+).*"')
ID var1 var2 var3 var4
1 1 name max age 8
2 2 name peter age 10
3 3 name susy age 100
Alternatively, if you just want the name
and age
columns, you can use str_extract
and positive look-behind:
library(stringr)
df %>%
mutate(name = str_extract(information, '(?<=name.{1,10}")\\w+'),
age = str_extract(information, '(?<=age.{1,10}")\\d+'))
ID information name age
1 1 {s:4:"name";s:3:"max";s:3:"age";s:1:"8";} max 8
2 2 {s:4:"name";s:5:"peter";s:3:"age";s:2:"10";} peter 10
3 3 {s:4:"name";s:4:"susy";s:3:"age";s:3:"100";} susy 100
Here the regex pattern for name
specifies the following instructions:
(?<=name.{1,10}")
: if you see the stringname
followed by at least 1 and at most 10 more characters of any kind and followed by"
then ...\\w+
: ... extract the alphanumeric string thereafter
答案2
得分: 1
这看起来是一个序列化的PHP对象。这里是一个稍微不同的正则表达式方法,它仍然假设一致的结构,但允许我们验证关键列的一致性,并处理名称和年龄中的标点符号(将有问题的记录添加到输入数据集):
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(ID = c(1,2,3,4),
information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
"{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
"{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}",
"{s:5:\"valid\";s:2:\"no\";s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
df %>%
# 匹配除双引号之外的任何内容,这些内容位于 ':\"' 和 '\";' 之间,
mutate(extr = str_extract_all(information, '(?<=:\")[^"]+(?=\";)')) %>%
unnest_wider(extr, names_sep = "_")
#> # A tibble: 4 × 8
#> ID information extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 "{s:4:\"name\";s:3:\"max\";s:… name max age 8 <NA> <NA>
#> 2 2 "{s:4:\"name\";s:5:\"peter\";… name peter age 10 <NA> <NA>
#> 3 3 "{s:4:\"name\";s:4:\"susy\";s… name susy age 100 <NA> <NA>
#> 4 4 "{s:5:\"valid\";s:2:\"no\";s:… valid no name susy age 100
创建于2023年5月15日,使用reprex v2.0.2
英文:
Looks like a serialized PHP object. Here's a slightly different regex approach, it still assumes consistent structure but allows us to verify consistency in key columns and does handle punctuation in names and ages (added offending record to input dataset):
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(ID = c(1,2,3,4),
information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
"{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
"{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}",
"{s:5:\"valid\";s:2:\"no\";s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
df %>%
# match anything but double-quotes that are between ':"' and '";',
mutate(extr = str_extract_all(information, '(?<=:")[^"]+(?=";)')) %>%
unnest_wider(extr, names_sep = "_")
#> # A tibble: 4 × 8
#> ID information extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 "{s:4:\"name\";s:3:\"max\";s:… name max age 8 <NA> <NA>
#> 2 2 "{s:4:\"name\";s:5:\"peter\";… name peter age 10 <NA> <NA>
#> 3 3 "{s:4:\"name\";s:4:\"susy\";s… name susy age 100 <NA> <NA>
#> 4 4 "{s:5:\"valid\";s:2:\"no\";s:… valid no name susy age 100
<sup>Created on 2023-05-15 with reprex v2.0.2</sup>
答案3
得分: 1
用分隔符拆分并读取为分隔文本,保留我们需要的列:
read.delim(text = gsub("";"", "":"", df$information, fixed = TRUE),
sep = "":"", header = FALSE)[, c(3, 6, 9, 12)]
# V3 V6 V9 V12
# 1 name max age 8
# 2 name peter age 10
# 3 name susy age 100
**注意:** 有一个模式 `seq(3, 12, 3)` 用于自动化列选择。
英文:
Split on delimiter and read as delimited text, keep columns we need:
read.delim(text = gsub(";", ":", df$information, fixed = TRUE),
sep = ":", header = FALSE)[, c(3, 6, 9, 12)]
# V3 V6 V9 V12
# 1 name max age 8
# 2 name peter age 10
# 3 name susy age 100
Note: there is a pattern seq(3, 12, 3)
to automate the column selection.
答案4
得分: 1
If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:
library(tidyverse)
unserialize_php <- function(information) {
system2("php", c("-r", shQuote("$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));")), stdout = TRUE, stderr = "", input = information) %>% fromJSON(simplifyDataFrame = TRUE)
}
df <- tibble(
ID = 1:3,
information = unserialize_php(
c("a:2:{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
"a:2:{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
"a:2:{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}")
)
)
df
# # A tibble: 3 x 2
# ID information$name $age
# <int> <chr> <chr>
# 1 1 max 8
# 2 2 peter 10
# 3 3 susy 100
The critical piece here is the unserialize_php
function. It forwards all values of the argument information
(or, the information
column of the data frame df
, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r
). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output
) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2
(because stdin = TRUE
). fromJSON
from the jsonlite
package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.
Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.
英文:
If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:
library(tidyverse)
unserialize_php <- function(information) {
system2("php", c("-r", shQuote("$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));")), stdout = TRUE, stderr = "", input = information) %>% fromJSON(simplifyDataFrame = TRUE)
}
df <- tibble(
ID = 1:3,
information = unserialize_php(
c("a:2:{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";};",
"a:2:{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";};",
"a:2:{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";};")
)
)
df
# # A tibble: 3 × 2
# ID information$name $age
# <int> <chr> <chr>
# 1 1 max 8
# 2 2 peter 10
# 3 3 susy 100
The critical piece here is the unserialize_php
function. It forwards all values of the argument information
(or, the information
column of the data frame df
, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r
). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output
) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2
(because stdin = TRUE
). fromJSON
from the jsonlite
package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.
Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论