r不将数据框视为字符;无法使用grep;as.character()的使用错误?

huangapple go评论126阅读模式
英文:

r doesn't see data frame as characters; can't grep; wrong use of as.character()?

问题

编辑,一天后:对这个问题的答案告诉我我需要相当大幅度地编辑我的代码。所以基本上这个问题现在已经不存在了,因为我不再在数据框中使用grep了。现在的代码如下,更加清晰。

我将原始问题保留在这里,以防我的学习过程能帮助任何人,尽管如此。

# 1. Find lines containing both "un" and "ɛ̃"
original_lines <- readLines('Test.txt')
lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)

# CHANGE PHONES TO FIND AND PHONES TO ADD

# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines <- character()
for (line in lines_with_pattern)
  modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))

# 3. Combine modified lines with original lines
all_lines <- c(original_lines, modified_lines)

# 4. Sort the lines alphabetically
sorted_lines <- sort(all_lines)

# 5. Print the sorted lines
writeLines(sorted_lines, 'myfile.txt', sep = '\n')

原始问题:

我试图在一个由两列组成的数据框中使用grep,两列之间以制表符分隔,例如:

              V1            V2
17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18     vemprunt&#233;   ɑ̃ p ʁ ɛ̃ t e
19    femprunt&#233;e   ɑ̃ p ʁ ɛ̃ t e
20   wemprunt&#233;es   ɑ̃ p ʁ ɛ̃ t e
21    2emprunt&#233;s   ɑ̃ p ʁ ɛ̃ t e

(摘录--数据框的最后五行。第一列包含类似法语的虚构单词;第二列包含国际音标字母表的音标。)

test <- read.delim('Test.txt', header=FALSE)
print(test)

产生如上所示的输出,所以看起来R“知道”数据框中有什么。

但是然后我想要在其中使用grep来查找特定的字符串,所以我尝试了以下操作:

# 1. Find lines containing both "un" and "ɛ̃"
lines_with_pattern <- grep("un", test, value = TRUE)
print(lines_with_pattern)

这不起作用。

上面的grep结果是named character(0)。这意味着R找不到它正在寻找的字符,因此我尝试了以下操作:

test <- read.delim('Test.txt', header=FALSE)
test <- as.character(test)

我认为我没有正确使用as.character(),因为这段代码产生了例如:

     V1 V2
 
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA

(再次是输出的最后五行)

因此print(test)产生了:

[1] "c(17, 18, 12, 20, 1)"
[2] "c(8, 6, 6, 6, 6)"

(结果向量中的最后五个数字)

lines_with_pattern <- grep("un", test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)

产生了character(0)

所以:我不理解print(test)产生的向量,上面的例子中的数字似乎与数据对应的任何内容都不符。而且,我的原始问题是:我需要做什么才能在这个数据集中使用grep

抱歉信息很长,也抱歉提了一个初学者的问题,但非常感谢任何帮助!

英文:

EDIT, A DAY LATER: The answers to this question have shown me that I needed to edit my code quite substantially. So basically the question has now gone away because I'm not grepping in a data-frame any more. The code is now as follows, much cleaner.

I'm leaving the original question here in case my learning process helps anyone, though.

# 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
original_lines &lt;- readLines(&#39;Test.txt&#39;)
lines_with_pattern &lt;- grep(&#39;un.*ɛ̃&#39;, original_lines, value = TRUE)

# CHANGE PHONES TO FIND AND PHONES TO ADD

# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines &lt;- character()
for (line in lines_with_pattern)
  modified_lines &lt;- c(modified_lines, gsub(&quot;ɛ̃&quot;, &quot;œ̃&quot;, line))

# 3. Combine modified lines with original lines
all_lines &lt;- c(original_lines, modified_lines)

# 4. Sort the lines alphabetically
sorted_lines &lt;- sort(all_lines)

# 5. Print the sorted lines
writeLines(sorted_lines, &#39;myfile.txt&#39;, sep = &#39;\\n&#39;)

ORIGINAL QUESTION

I am trying to grep a data frame consisting of rows of two columns with tab separation between the columns, e.g.

              V1            V2
17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18     vemprunt&#233;   ɑ̃ p ʁ ɛ̃ t e
19    femprunt&#233;e   ɑ̃ p ʁ ɛ̃ t e
20   wemprunt&#233;es   ɑ̃ p ʁ ɛ̃ t e
21    2emprunt&#233;s   ɑ̃ p ʁ ɛ̃ t e

(excerpt--the last five rows of the data frame. The first col contains dummy French-like words; the second col contains International Phonetic Alphabet transcriptions.)

test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
print(test)

produces a printout as above, so it looks as if R 'knows' what's in the data frame.

But then I want to grep for certain strings, so I've tried

# 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
print(lines_with_pattern)

and this doesn't work.

The above grep results in named character(0). This means R is not finding the characters it's looking for, so I've tried

test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
test &lt;- as.character(test)

I don't think I'm using as.character() right, as that snippet produces e.g.

     V1 V2
 
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA

(again the last five rows of the output)

and therefore print(test) produces

[1] &quot;c(17, 18, 12, 20, 1)&quot;
[2] &quot;c(8, 6, 6, 6, 6)&quot;

(last five figures in the resulting vectors)

and

lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)

produces character(0).

So: I don't understand the vectors that print(test) produces in the example just above--the numbers don't seem to refer to anything corresponding to the data. And, my original question: what do I need to do to be able to grep this data-set?

Sorry for the very long message, and for the noob question, but thanks a lot for any help!

答案1

得分: 3

`grep()` 不能直接用于数据框。此外,数据框在逐行操作方面效果不佳,这似乎是您感兴趣的操作。在基础 R 中,我会使用 `apply()`(它会隐式将数据转换为字符矩阵)和 `grepl()` 的组合:

```r
which(apply(
  test,
  MARGIN = 1,
  FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
))

这将给您所有包含 "un" 和 "ɛ̃" 的行。使用 which() 是为了获得行号,而不是逻辑向量(这也非常适用于单独进行子集选择)。


<details>
<summary>英文:</summary>

`grep()` cannot be used as-is on data frames. Moreover, data frame&#39;s do not work well for row-wise operations which is what you seem to be interested in doing. In base R, I would use a combination of `apply()` (which will do an implicit conversion of your data to a character matrix) and `grepl()`:

```r
which(apply(
  test,
  MARGIN = 1,
  FUN = function(x) any(grepl(&quot;un&quot;, x)) &amp; any(grepl(&quot;ɛ̃&quot;, x))
))

This will give you all the rows where both "un" and "ɛ̃" appear. The use of which() is to get row numbers rather than a logical vector (which also works perfectly fine for subsetting on its own).

huangapple
  • 本文由 发表于 2023年6月5日 19:49:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406125.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定