How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

huangapple go评论124阅读模式
英文:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

问题

我有一些CSV数据,其中反引号(`)作为字符串包围符,日元符号(¥)作为转义字符。

示例:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

我尝试读取原始文件并将日元符号替换为反斜杠,但没有成功。

  1. fl <- readLines("data.csv", encoding = "UTF-8")
  2. fl2 <- gsub('¥', "\\", fl)
  3. writeLines(fl2, "Edited_data.txt")
  4. sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")

期望的数据框如下:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

英文:

I have CSV data with backtick (`)as a string encloser and yen symbol (¥)‎ as an escape character.

Example :

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

I tried reading the raw file and replaced yen symbol with a backslash but not working.

  1. fl <- readLines("data.csv", encoding = "UTF-8")
  2. fl2 <- gsub('¥', "\\", fl)
  3. writeLines(fl2, "Edited_data.txt")
  4. sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")

Expected Dataframe

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

答案1

得分: 1

我无法访问您的数据,因为它是一幅图像,但以下是使用readr的版本:

  1. library(readr)
  2. dt <- "Sentence, Value1, Value2\n`这是第一行`, 0, 0\n`这是,带有逗号的其他内容¥的东西`, 0, 0"
  3. # 我们可以读取您的数据,尊重``内的字符串,并读取`¥`符号。
  4. dt_read <- read_csv(dt, quote = "`")
  5. dt_read
  6. #> # A tibble: 2 x 3
  7. #> Sentence Value1 Value2
  8. #> <chr> <dbl> <dbl>
  9. #> 1 这是第一行 0 0
  10. #> 2 这是,带有逗号的其他内容¥的东西 0 0
  11. # 然后,我们只需将该符号替换为空
  12. dt_read$Sentence <- gsub("¥", "", dt_read$Sentence)
  13. dt_read
  14. #> # A tibble: 2 x 3
  15. #> Sentence Value1 Value2
  16. #> <chr> <dbl> <dbl>
  17. #> 1 这是第一行 0 0
  18. #> 2 这是,带有逗号的其他内容的东西 0 0

希望这对您有所帮助。

英文:

I couldn't access your data since it's an image but here's a version with readr:

  1. library(readr)
  2. dt &lt;- &quot;Sentence, Value1, Value2\n`This is the first row`, 0, 0\n`This , this is something else with a comma&#165;`, 0, 0&quot;
  3. # We can read for your data, respect your strings within `` and read the the `&#165;` symbol.
  4. dt_read &lt;- read_csv(dt, quote = &quot;`&quot;)
  5. dt_read
  6. #&gt; # A tibble: 2 x 3
  7. #&gt; Sentence Value1 Value2
  8. #&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
  9. #&gt; 1 This is the first row 0 0
  10. #&gt; 2 This , this is something else with a comma&#165; 0 0
  11. # Then, we just replace that symbol with nothing
  12. dt_read$Sentence &lt;- gsub(&quot;&#165;&quot;, &quot;&quot;, dt_read$Sentence)
  13. dt_read
  14. #&gt; # A tibble: 2 x 3
  15. #&gt; Sentence Value1 Value2
  16. #&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
  17. #&gt; 1 This is the first row 0 0
  18. #&gt; 2 This , this is something else with a comma 0 0

答案2

得分: 1

你可以更改转义序列为任何你喜欢的方式,并在阅读文本后将其改回。我已在此处复制了您的数据:

  1. yen &lt;- c(&quot;Sentence,Value1,Value2&quot;,
  2. &quot;`ML Taper, Triology TM`,0,0&quot;,
  3. &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;,
  4. &quot;`D/3,E/4`,0,0&quot;)
  5. writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

现在的代码如下:

  1. library(data.table)
  2. # 读取数据时不指定编码,以处理 ANSI 或 UTF8 日元符号
  3. fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))
  4. # UTF8 编码的日元符号为 0xc2 0xa5,因此我们希望将其编码为这种方式
  5. utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
  6. ansi_yen &lt;- rawToChar(as.raw(0xa5))
  7. fl &lt;- gsub(utf8_yen, ansi_yen, fl)
  8. # 粘贴上我们的反引号以获取反引号转义
  9. yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)
  10. # 更改反引号转义,然后删除所有日元符号
  11. fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
  12. fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)
  13. # 保存我们修改后的字符串并重新加载为数据框
  14. writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
  15. sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
  16. sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)
  17. # 现在我们可以取消转义反引号,完成了
  18. sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

所以现在我们有:

  1. sms_data
  2. #&gt; Sentence Value1 Value2
  3. #&gt; 1: ML Taper, Triology TM 0 0
  4. #&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE 0 0
  5. #&gt; 3: D/3,E/4 0 0
英文:

You can change the escape sequence to whatever you like and change it back once you read the text in. I have reproduced your data here:

  1. yen &lt;- c(&quot;Sentence,Value1,Value2&quot;,
  2. &quot;`ML Taper, Triology TM`,0,0&quot;,
  3. &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;,
  4. &quot;`D/3,E/4`,0,0&quot;)
  5. writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

Now the code

  1. library(data.table)
  2. # Read data without specifying encoding to handle ANSI or UTF8 yens
  3. fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))
  4. # The yen symbol is 0xc2 0xa5 in UTF8, so we want it encoded this way
  5. utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
  6. ansi_yen &lt;- rawToChar(as.raw(0xa5))
  7. fl &lt;- gsub(utf8_yen, ansi_yen, fl)
  8. # Paste on our backtick to get the backtick escape
  9. yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)
  10. # Change the backtick escape then remove all yen nsymbols
  11. fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
  12. fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)
  13. # Save our modified string and reload it as a dataframe
  14. writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
  15. sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
  16. sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)
  17. # Now we can unescape our backticks and we&#39;re done
  18. sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

So now we have

  1. sms_data
  2. #&gt; Sentence Value1 Value2
  3. #&gt; 1: ML Taper, Triology TM 0 0
  4. #&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE 0 0
  5. #&gt; 3: D/3,E/4 0 0

huangapple
  • 本文由 发表于 2020年1月3日 13:42:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/59573707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定