2020年1月3日 13:42:41go评论124阅读模式

英文:

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

问题

我有一些CSV数据，其中反引号（`）作为字符串包围符，日元符号（¥）作为转义字符。

示例：

我尝试读取原始文件并将日元符号替换为反斜杠，但没有成功。

fl &lt;- readLines(&quot;data.csv&quot;, encoding = &quot;UTF-8&quot;)
fl2 &lt;- gsub(&#39;&#165;&#39;, &quot;\\&quot;, fl)
writeLines(fl2, &quot;Edited_data.txt&quot;)
sms_data &lt;- fread(&quot;Edited_data.txt&quot;, sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;, encoding = &quot;UTF-8&quot;)

期望的数据框如下：

英文:

I have CSV data with backtick (`)as a string encloser and yen symbol (¥)‎ as an escape character.

Example :

I tried reading the raw file and replaced yen symbol with a backslash but not working.

fl &lt;- readLines(&quot;data.csv&quot;, encoding = &quot;UTF-8&quot;)
fl2 &lt;- gsub(&#39;&#165;&#39;, &quot;\\&quot;, fl)
writeLines(fl2, &quot;Edited_data.txt&quot;)
sms_data &lt;- fread(&quot;Edited_data.txt&quot;, sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;, encoding = &quot;UTF-8&quot;)

Expected Dataframe

答案1

得分: 1

我无法访问您的数据，因为它是一幅图像，但以下是使用readr的版本：

library(readr)
dt <- "Sentence, Value1, Value2\n`这是第一行`, 0, 0\n`这是，带有逗号的其他内容¥的东西`, 0, 0"
# 我们可以读取您的数据，尊重``内的字符串，并读取`¥`符号。
dt_read <- read_csv(dt, quote = "`")
dt_read
#> # A tibble: 2 x 3
#>   Sentence                            Value1 Value2
#>   <chr>                                <dbl>  <dbl>
#> 1 这是第一行                           0      0
#> 2 这是，带有逗号的其他内容¥的东西   0      0
# 然后，我们只需将该符号替换为空
dt_read$Sentence <- gsub("¥", "", dt_read$Sentence)
dt_read
#> # A tibble: 2 x 3
#>   Sentence                            Value1 Value2
#>   <chr>                                <dbl>  <dbl>
#> 1 这是第一行                           0      0
#> 2 这是，带有逗号的其他内容的东西   0      0

希望这对您有所帮助。

英文:

I couldn't access your data since it's an image but here's a version with readr:

library(readr)
dt &lt;- &quot;Sentence, Value1, Value2\n`This is the first row`, 0, 0\n`This , this is something else with a comma&#165;`, 0, 0&quot;
# We can read for your data, respect your strings within `` and read the the `&#165;` symbol.
dt_read &lt;- read_csv(dt, quote = &quot;`&quot;)
dt_read
#&gt; # A tibble: 2 x 3
#&gt;   Sentence                                    Value1 Value2
#&gt;   &lt;chr&gt;                                        &lt;dbl&gt;  &lt;dbl&gt;
#&gt; 1 This is the first row                            0      0
#&gt; 2 This , this is something else with a comma&#165;      0      0
# Then, we just replace that symbol with nothing
dt_read$Sentence &lt;- gsub(&quot;&#165;&quot;, &quot;&quot;, dt_read$Sentence)
dt_read
#&gt; # A tibble: 2 x 3
#&gt;   Sentence                                   Value1 Value2
#&gt;   &lt;chr&gt;                                       &lt;dbl&gt;  &lt;dbl&gt;
#&gt; 1 This is the first row                           0      0
#&gt; 2 This , this is something else with a comma      0      0

答案2

得分: 1

你可以更改转义序列为任何你喜欢的方式，并在阅读文本后将其改回。我已在此处复制了您的数据：

yen &lt;- c(&quot;Sentence,Value1,Value2&quot;, 
         &quot;`ML Taper, Triology TM`,0,0&quot;, 
         &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;, 
         &quot;`D/3,E/4`,0,0&quot;)
writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

现在的代码如下：

library(data.table)
# 读取数据时不指定编码，以处理 ANSI 或 UTF8 日元符号
fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))
# UTF8 编码的日元符号为 0xc2 0xa5，因此我们希望将其编码为这种方式
utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen &lt;- rawToChar(as.raw(0xa5))
fl &lt;- gsub(utf8_yen, ansi_yen, fl)
# 粘贴上我们的反引号以获取反引号转义
yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)
# 更改反引号转义，然后删除所有日元符号
fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)
# 保存我们修改后的字符串并重新加载为数据框
writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
                  sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)
# 现在我们可以取消转义反引号，完成了
sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

所以现在我们有：

sms_data
#&gt;                           Sentence Value1 Value2
#&gt; 1:           ML Taper, Triology TM      0      0
#&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE      0      0
#&gt; 3:                         D/3,E/4      0      0

英文:

You can change the escape sequence to whatever you like and change it back once you read the text in. I have reproduced your data here:

yen &lt;- c(&quot;Sentence,Value1,Value2&quot;, 
         &quot;`ML Taper, Triology TM`,0,0&quot;, 
         &quot;90481 3TBS/&#165;`10TRYS/1SR PAUL/JOE,0,0&quot;, 
         &quot;`D/3,E/4`,0,0&quot;)
writeLines(yen, path.expand(&quot;~/yen.csv&quot;))

Now the code

library(data.table)
# Read data without specifying encoding to handle ANSI or UTF8 yens
fl &lt;- readLines(path.expand(&quot;~/yen.csv&quot;))
# The yen symbol is 0xc2 0xa5 in UTF8, so we want it encoded this way
utf8_yen &lt;- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen &lt;- rawToChar(as.raw(0xa5))
fl &lt;- gsub(utf8_yen, ansi_yen, fl)
# Paste on our backtick to get the backtick escape
yen_tick &lt;- paste0(ansi_yen, &quot;`&quot;)
# Change the backtick escape then remove all yen nsymbols
fl2 &lt;- gsub(yen_tick, &quot;&amp;backtick;&quot;, fl)
fl2 &lt;- gsub(ansi_yen, &quot;&quot;, fl2)
# Save our modified string and reload it as a dataframe
writeLines(fl2, path.expand(&quot;~/Edited_data.txt&quot;))
sms_data &lt;- fread(path.expand(&quot;~/Edited_data.txt&quot;),
                  sep = &quot;,&quot;, stringsAsFactors = FALSE, quote = &quot;\`&quot;, dec = &quot;.&quot;)
# Now we can unescape our backticks and we&#39;re done
sms_data$Sentence &lt;- gsub(&quot;&amp;backtick;&quot;, &quot;`&quot;, sms_data$Sentence)

So now we have

sms_data
#&gt;                           Sentence Value1 Value2
#&gt; 1:           ML Taper, Triology TM      0      0
#&gt; 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE      0      0
#&gt; 3:                         D/3,E/4      0      0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to read a csv in R with backtick as string encoser and ¥‎ as escape character?

问题

答案1

答案2

R中时间序列的滚动均值，包括缺失日期。

在R中，在数据框中按因子水平添加一列比例：

将列求和，然后除以相邻的单元格。

ifelse和if_else在返回NA的条件语句中的区别

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。