r不将数据框视为字符;无法使用grep;as.character()的使用错误?

huangapple go评论150阅读模式
英文:

r doesn't see data frame as characters; can't grep; wrong use of as.character()?

问题

编辑,一天后:对这个问题的答案告诉我我需要相当大幅度地编辑我的代码。所以基本上这个问题现在已经不存在了,因为我不再在数据框中使用grep了。现在的代码如下,更加清晰。

我将原始问题保留在这里,以防我的学习过程能帮助任何人,尽管如此。

  1. # 1. Find lines containing both "un" and "ɛ̃"
  2. original_lines <- readLines('Test.txt')
  3. lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)
  4. # CHANGE PHONES TO FIND AND PHONES TO ADD
  5. # 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
  6. modified_lines <- character()
  7. for (line in lines_with_pattern)
  8. modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))
  9. # 3. Combine modified lines with original lines
  10. all_lines <- c(original_lines, modified_lines)
  11. # 4. Sort the lines alphabetically
  12. sorted_lines <- sort(all_lines)
  13. # 5. Print the sorted lines
  14. writeLines(sorted_lines, 'myfile.txt', sep = '\n')

原始问题:

我试图在一个由两列组成的数据框中使用grep,两列之间以制表符分隔,例如:

  1. V1 V2
  2. 17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
  3. 18 vemprunt&#233; ɑ̃ p ʁ ɛ̃ t e
  4. 19 femprunt&#233;e ɑ̃ p ʁ ɛ̃ t e
  5. 20 wemprunt&#233;es ɑ̃ p ʁ ɛ̃ t e
  6. 21 2emprunt&#233;s ɑ̃ p ʁ ɛ̃ t e

(摘录--数据框的最后五行。第一列包含类似法语的虚构单词;第二列包含国际音标字母表的音标。)

  1. test <- read.delim('Test.txt', header=FALSE)
  2. print(test)

产生如上所示的输出,所以看起来R“知道”数据框中有什么。

但是然后我想要在其中使用grep来查找特定的字符串,所以我尝试了以下操作:

  1. # 1. Find lines containing both "un" and "ɛ̃"
  2. lines_with_pattern <- grep("un", test, value = TRUE)
  3. print(lines_with_pattern)

这不起作用。

上面的grep结果是named character(0)。这意味着R找不到它正在寻找的字符,因此我尝试了以下操作:

  1. test <- read.delim('Test.txt', header=FALSE)
  2. test <- as.character(test)

我认为我没有正确使用as.character(),因为这段代码产生了例如:

  1. V1 V2
  2. [17,] NA NA
  3. [18,] NA NA
  4. [19,] NA NA
  5. [20,] NA NA
  6. [21,] NA NA

(再次是输出的最后五行)

因此print(test)产生了:

  1. [1] "c(17, 18, 12, 20, 1)"
  2. [2] "c(8, 6, 6, 6, 6)"

(结果向量中的最后五个数字)

  1. lines_with_pattern <- grep("un", test, value = TRUE)
  2. # value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
  3. print(lines_with_pattern)

产生了character(0)

所以:我不理解print(test)产生的向量,上面的例子中的数字似乎与数据对应的任何内容都不符。而且,我的原始问题是:我需要做什么才能在这个数据集中使用grep

抱歉信息很长,也抱歉提了一个初学者的问题,但非常感谢任何帮助!

英文:

EDIT, A DAY LATER: The answers to this question have shown me that I needed to edit my code quite substantially. So basically the question has now gone away because I'm not grepping in a data-frame any more. The code is now as follows, much cleaner.

I'm leaving the original question here in case my learning process helps anyone, though.

  1. # 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
  2. original_lines &lt;- readLines(&#39;Test.txt&#39;)
  3. lines_with_pattern &lt;- grep(&#39;un.*ɛ̃&#39;, original_lines, value = TRUE)
  4. # CHANGE PHONES TO FIND AND PHONES TO ADD
  5. # 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
  6. modified_lines &lt;- character()
  7. for (line in lines_with_pattern)
  8. modified_lines &lt;- c(modified_lines, gsub(&quot;ɛ̃&quot;, &quot;œ̃&quot;, line))
  9. # 3. Combine modified lines with original lines
  10. all_lines &lt;- c(original_lines, modified_lines)
  11. # 4. Sort the lines alphabetically
  12. sorted_lines &lt;- sort(all_lines)
  13. # 5. Print the sorted lines
  14. writeLines(sorted_lines, &#39;myfile.txt&#39;, sep = &#39;\\n&#39;)

ORIGINAL QUESTION

I am trying to grep a data frame consisting of rows of two columns with tab separation between the columns, e.g.

  1. V1 V2
  2. 17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
  3. 18 vemprunt&#233; ɑ̃ p ʁ ɛ̃ t e
  4. 19 femprunt&#233;e ɑ̃ p ʁ ɛ̃ t e
  5. 20 wemprunt&#233;es ɑ̃ p ʁ ɛ̃ t e
  6. 21 2emprunt&#233;s ɑ̃ p ʁ ɛ̃ t e

(excerpt--the last five rows of the data frame. The first col contains dummy French-like words; the second col contains International Phonetic Alphabet transcriptions.)

  1. test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
  2. print(test)

produces a printout as above, so it looks as if R 'knows' what's in the data frame.

But then I want to grep for certain strings, so I've tried

  1. # 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
  2. lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
  3. print(lines_with_pattern)

and this doesn't work.

The above grep results in named character(0). This means R is not finding the characters it's looking for, so I've tried

  1. test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
  2. test &lt;- as.character(test)

I don't think I'm using as.character() right, as that snippet produces e.g.

  1. V1 V2
  2. [17,] NA NA
  3. [18,] NA NA
  4. [19,] NA NA
  5. [20,] NA NA
  6. [21,] NA NA

(again the last five rows of the output)

and therefore print(test) produces

  1. [1] &quot;c(17, 18, 12, 20, 1)&quot;
  2. [2] &quot;c(8, 6, 6, 6, 6)&quot;

(last five figures in the resulting vectors)

and

  1. lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
  2. # value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
  3. print(lines_with_pattern)

produces character(0).

So: I don't understand the vectors that print(test) produces in the example just above--the numbers don't seem to refer to anything corresponding to the data. And, my original question: what do I need to do to be able to grep this data-set?

Sorry for the very long message, and for the noob question, but thanks a lot for any help!

答案1

得分: 3

  1. `grep()` 不能直接用于数据框。此外,数据框在逐行操作方面效果不佳,这似乎是您感兴趣的操作。在基础 R 中,我会使用 `apply()`(它会隐式将数据转换为字符矩阵)和 `grepl()` 的组合:
  2. ```r
  3. which(apply(
  4. test,
  5. MARGIN = 1,
  6. FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
  7. ))

这将给您所有包含 "un" 和 "ɛ̃" 的行。使用 which() 是为了获得行号,而不是逻辑向量(这也非常适用于单独进行子集选择)。

  1. <details>
  2. <summary>英文:</summary>
  3. `grep()` cannot be used as-is on data frames. Moreover, data frame&#39;s do not work well for row-wise operations which is what you seem to be interested in doing. In base R, I would use a combination of `apply()` (which will do an implicit conversion of your data to a character matrix) and `grepl()`:
  4. ```r
  5. which(apply(
  6. test,
  7. MARGIN = 1,
  8. FUN = function(x) any(grepl(&quot;un&quot;, x)) &amp; any(grepl(&quot;ɛ̃&quot;, x))
  9. ))

This will give you all the rows where both "un" and "ɛ̃" appear. The use of which() is to get row numbers rather than a logical vector (which also works perfectly fine for subsetting on its own).

huangapple
  • 本文由 发表于 2023年6月5日 19:49:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406125.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定