How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

huangapple go评论95阅读模式
英文:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

问题

我有一些CSV数据,其中反引号(`)作为字符串包围符,日元符号(¥)作为转义字符。

示例:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

我尝试读取原始文件并将日元符号替换为反斜杠,但没有成功。

fl <- readLines("data.csv", encoding = "UTF-8")
fl2 <- gsub('¥', "\\", fl)
writeLines(fl2, "Edited_data.txt")
sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")

期望的数据框如下:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

英文:

I have CSV data with backtick (`)as a string encloser and yen symbol (¥)‎ as an escape character.

Example :

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

I tried reading the raw file and replaced yen symbol with a backslash but not working.

fl <- readLines("data.csv", encoding = "UTF-8")
fl2 <- gsub('¥', "\\", fl)
writeLines(fl2, "Edited_data.txt")
sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")

Expected Dataframe

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

答案1

得分: 1

我无法访问您的数据,因为它是一幅图像,但以下是使用readr的版本:

library(readr)
dt <- "Sentence, Value1, Value2\n`这是第一行`, 0, 0\n`这是,带有逗号的其他内容¥的东西`, 0, 0"

# 我们可以读取您的数据,尊重``内的字符串,并读取`¥`符号。
dt_read <- read_csv(dt, quote = "`")
dt_read
#> # A tibble: 2 x 3
#>   Sentence                            Value1 Value2
#>   <chr>                                <dbl>  <dbl>
#> 1 这是第一行                           0      0
#> 2 这是,带有逗号的其他内容¥的东西   0      0

# 然后,我们只需将该符号替换为空
dt_read$Sentence <- gsub("¥", "", dt_read$Sentence)
dt_read
#> # A tibble: 2 x 3
#>   Sentence                            Value1 Value2
#>   <chr>                                <dbl>  <dbl>
#> 1 这是第一行                           0      0
#> 2 这是,带有逗号的其他内容的东西   0      0

希望这对您有所帮助。

英文:

I couldn't access your data since it's an image but here's a version with readr:

library(readr)
dt &lt;- &quot;Sentence, Value1, Value2\n`This is the first row`, 0, 0\n`This , this is something else with a comma&#165;`, 0, 0&quot;

# We can read for your data, respect your strings within `` and read the the `&#165;` symbol.
dt_read &lt;- read_csv(dt, quote = &quot;`&quot;)
dt_read
#&gt; # A tibble: 2 x 3
#&gt;   Sentence                                    Value1 Value2
#&gt;   &lt;chr&gt;                                        &lt;dbl&gt;  &lt;dbl&gt;
#&gt; 1 This is the first row                            0      0
#&gt; 2 This , this is something else with a comma&#165;      0      0

# Then, we just replace that symbol with nothing
dt_read$Sentence &lt;- gsub(&quot;&#165;&quot;, &quot;&quot;, dt_read$Sentence)
dt_read
#&gt; # A tibble: 2 x 3
#&gt;   Sentence                                   Value1 Value2
#&gt;   &lt;chr&gt;                                       &lt;dbl&gt;  &lt;dbl&gt;
#&gt; 1 This is the first row                           0      0
#&gt; 2 This , this is something else with a comma      0      0

答案2

得分: 1

你可以更改转义序列为任何你喜欢的方式,并在阅读文本后将其改回。我已在此处复制了您的数据:

yen &lt;- c(&quot;Sentence,Value1,Value2&quot;, 
         &quot;`ML Taper, Triology TM`,0,0&quot;, 
         &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;, 
         &quot;`D/3,E/4`,0,0&quot;)
writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

现在的代码如下:

library(data.table)

# 读取数据时不指定编码,以处理 ANSI 或 UTF8 日元符号
fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))

# UTF8 编码的日元符号为 0xc2 0xa5,因此我们希望将其编码为这种方式
utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen &lt;- rawToChar(as.raw(0xa5))
fl &lt;- gsub(utf8_yen, ansi_yen, fl)

# 粘贴上我们的反引号以获取反引号转义
yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)

# 更改反引号转义,然后删除所有日元符号
fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)

# 保存我们修改后的字符串并重新加载为数据框
writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
                  sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)

# 现在我们可以取消转义反引号,完成了
sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

所以现在我们有:

sms_data
#&gt;                           Sentence Value1 Value2
#&gt; 1:           ML Taper, Triology TM      0      0
#&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE      0      0
#&gt; 3:                         D/3,E/4      0      0
英文:

You can change the escape sequence to whatever you like and change it back once you read the text in. I have reproduced your data here:

yen &lt;- c(&quot;Sentence,Value1,Value2&quot;, 
         &quot;`ML Taper, Triology TM`,0,0&quot;, 
         &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;, 
         &quot;`D/3,E/4`,0,0&quot;)
writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

Now the code

library(data.table)

# Read data without specifying encoding to handle ANSI or UTF8 yens
fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))

# The yen symbol is 0xc2 0xa5 in UTF8, so we want it encoded this way
utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen &lt;- rawToChar(as.raw(0xa5))
fl &lt;- gsub(utf8_yen, ansi_yen, fl)

# Paste on our backtick to get the backtick escape
yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)

# Change the backtick escape then remove all yen nsymbols
fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)

# Save our modified string and reload it as a dataframe
writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
                  sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)

# Now we can unescape our backticks and we&#39;re done
sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

So now we have

sms_data
#&gt;                           Sentence Value1 Value2
#&gt; 1:           ML Taper, Triology TM      0      0
#&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE      0      0
#&gt; 3:                         D/3,E/4      0      0

huangapple
  • 本文由 发表于 2020年1月3日 13:42:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/59573707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定