英文:
How to read a csv in R with backtick as string encoser and ¥ as escape character?
问题
我有一些CSV数据,其中反引号(`)作为字符串包围符,日元符号(¥)作为转义字符。
示例:
我尝试读取原始文件并将日元符号替换为反斜杠,但没有成功。
fl <- readLines("data.csv", encoding = "UTF-8")
fl2 <- gsub('¥', "\\", fl)
writeLines(fl2, "Edited_data.txt")
sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")
期望的数据框如下:
英文:
I have CSV data with backtick (`)as a string encloser and yen symbol (¥) as an escape character.
Example :
I tried reading the raw file and replaced yen symbol with a backslash but not working.
fl <- readLines("data.csv", encoding = "UTF-8")
fl2 <- gsub('¥', "\\", fl)
writeLines(fl2, "Edited_data.txt")
sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")
Expected Dataframe
答案1
得分: 1
我无法访问您的数据,因为它是一幅图像,但以下是使用readr
的版本:
library(readr)
dt <- "Sentence, Value1, Value2\n`这是第一行`, 0, 0\n`这是,带有逗号的其他内容¥的东西`, 0, 0"
# 我们可以读取您的数据,尊重``内的字符串,并读取`¥`符号。
dt_read <- read_csv(dt, quote = "`")
dt_read
#> # A tibble: 2 x 3
#> Sentence Value1 Value2
#> <chr> <dbl> <dbl>
#> 1 这是第一行 0 0
#> 2 这是,带有逗号的其他内容¥的东西 0 0
# 然后,我们只需将该符号替换为空
dt_read$Sentence <- gsub("¥", "", dt_read$Sentence)
dt_read
#> # A tibble: 2 x 3
#> Sentence Value1 Value2
#> <chr> <dbl> <dbl>
#> 1 这是第一行 0 0
#> 2 这是,带有逗号的其他内容的东西 0 0
希望这对您有所帮助。
英文:
I couldn't access your data since it's an image but here's a version with readr
:
library(readr)
dt <- "Sentence, Value1, Value2\n`This is the first row`, 0, 0\n`This , this is something else with a comma¥`, 0, 0"
# We can read for your data, respect your strings within `` and read the the `¥` symbol.
dt_read <- read_csv(dt, quote = "`")
dt_read
#> # A tibble: 2 x 3
#> Sentence Value1 Value2
#> <chr> <dbl> <dbl>
#> 1 This is the first row 0 0
#> 2 This , this is something else with a comma¥ 0 0
# Then, we just replace that symbol with nothing
dt_read$Sentence <- gsub("¥", "", dt_read$Sentence)
dt_read
#> # A tibble: 2 x 3
#> Sentence Value1 Value2
#> <chr> <dbl> <dbl>
#> 1 This is the first row 0 0
#> 2 This , this is something else with a comma 0 0
答案2
得分: 1
你可以更改转义序列为任何你喜欢的方式,并在阅读文本后将其改回。我已在此处复制了您的数据:
yen <- c("Sentence,Value1,Value2",
"`ML Taper, Triology TM`,0,0",
"90481 3TBS/¥`10TRYS/1SR PAUL/JOE,0,0",
"`D/3,E/4`,0,0")
writeLines(yen, path.expand("~/yen.csv"))
现在的代码如下:
library(data.table)
# 读取数据时不指定编码,以处理 ANSI 或 UTF8 日元符号
fl <- readLines(path.expand("~/yen.csv"))
# UTF8 编码的日元符号为 0xc2 0xa5,因此我们希望将其编码为这种方式
utf8_yen <- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen <- rawToChar(as.raw(0xa5))
fl <- gsub(utf8_yen, ansi_yen, fl)
# 粘贴上我们的反引号以获取反引号转义
yen_tick <- paste0(ansi_yen, "`")
# 更改反引号转义,然后删除所有日元符号
fl2 <- gsub(yen_tick, "&backtick;", fl)
fl2 <- gsub(ansi_yen, "", fl2)
# 保存我们修改后的字符串并重新加载为数据框
writeLines(fl2, path.expand("~/Edited_data.txt"))
sms_data <- fread(path.expand("~/Edited_data.txt"),
sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".")
# 现在我们可以取消转义反引号,完成了
sms_data$Sentence <- gsub("&backtick;", "`", sms_data$Sentence)
所以现在我们有:
sms_data
#> Sentence Value1 Value2
#> 1: ML Taper, Triology TM 0 0
#> 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE 0 0
#> 3: D/3,E/4 0 0
英文:
You can change the escape sequence to whatever you like and change it back once you read the text in. I have reproduced your data here:
yen <- c("Sentence,Value1,Value2",
"`ML Taper, Triology TM`,0,0",
"90481 3TBS/¥`10TRYS/1SR PAUL/JOE,0,0",
"`D/3,E/4`,0,0")
writeLines(yen, path.expand("~/yen.csv"))
Now the code
library(data.table)
# Read data without specifying encoding to handle ANSI or UTF8 yens
fl <- readLines(path.expand("~/yen.csv"))
# The yen symbol is 0xc2 0xa5 in UTF8, so we want it encoded this way
utf8_yen <- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen <- rawToChar(as.raw(0xa5))
fl <- gsub(utf8_yen, ansi_yen, fl)
# Paste on our backtick to get the backtick escape
yen_tick <- paste0(ansi_yen, "`")
# Change the backtick escape then remove all yen nsymbols
fl2 <- gsub(yen_tick, "&backtick;", fl)
fl2 <- gsub(ansi_yen, "", fl2)
# Save our modified string and reload it as a dataframe
writeLines(fl2, path.expand("~/Edited_data.txt"))
sms_data <- fread(path.expand("~/Edited_data.txt"),
sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".")
# Now we can unescape our backticks and we're done
sms_data$Sentence <- gsub("&backtick;", "`", sms_data$Sentence)
So now we have
sms_data
#> Sentence Value1 Value2
#> 1: ML Taper, Triology TM 0 0
#> 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE 0 0
#> 3: D/3,E/4 0 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论