英文:
r doesn't see data frame as characters; can't grep; wrong use of as.character()?
问题
编辑,一天后:对这个问题的答案告诉我我需要相当大幅度地编辑我的代码。所以基本上这个问题现在已经不存在了,因为我不再在数据框中使用grep
了。现在的代码如下,更加清晰。
我将原始问题保留在这里,以防我的学习过程能帮助任何人,尽管如此。
# 1. Find lines containing both "un" and "ɛ̃"
original_lines <- readLines('Test.txt')
lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)
# CHANGE PHONES TO FIND AND PHONES TO ADD
# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines <- character()
for (line in lines_with_pattern)
modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))
# 3. Combine modified lines with original lines
all_lines <- c(original_lines, modified_lines)
# 4. Sort the lines alphabetically
sorted_lines <- sort(all_lines)
# 5. Print the sorted lines
writeLines(sorted_lines, 'myfile.txt', sep = '\n')
原始问题:
我试图在一个由两列组成的数据框中使用grep
,两列之间以制表符分隔,例如:
V1 V2
17 nempruntèrent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18 vemprunté ɑ̃ p ʁ ɛ̃ t e
19 fempruntée ɑ̃ p ʁ ɛ̃ t e
20 wempruntées ɑ̃ p ʁ ɛ̃ t e
21 2empruntés ɑ̃ p ʁ ɛ̃ t e
(摘录--数据框的最后五行。第一列包含类似法语的虚构单词;第二列包含国际音标字母表的音标。)
test <- read.delim('Test.txt', header=FALSE)
print(test)
产生如上所示的输出,所以看起来R“知道”数据框中有什么。
但是然后我想要在其中使用grep
来查找特定的字符串,所以我尝试了以下操作:
# 1. Find lines containing both "un" and "ɛ̃"
lines_with_pattern <- grep("un", test, value = TRUE)
print(lines_with_pattern)
这不起作用。
上面的grep
结果是named character(0)
。这意味着R找不到它正在寻找的字符,因此我尝试了以下操作:
test <- read.delim('Test.txt', header=FALSE)
test <- as.character(test)
我认为我没有正确使用as.character()
,因为这段代码产生了例如:
V1 V2
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA
(再次是输出的最后五行)
因此print(test)
产生了:
[1] "c(17, 18, 12, 20, 1)"
[2] "c(8, 6, 6, 6, 6)"
(结果向量中的最后五个数字)
和
lines_with_pattern <- grep("un", test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)
产生了character(0)
。
所以:我不理解print(test)
产生的向量,上面的例子中的数字似乎与数据对应的任何内容都不符。而且,我的原始问题是:我需要做什么才能在这个数据集中使用grep
?
抱歉信息很长,也抱歉提了一个初学者的问题,但非常感谢任何帮助!
英文:
EDIT, A DAY LATER: The answers to this question have shown me that I needed to edit my code quite substantially. So basically the question has now gone away because I'm not grep
ping in a data-frame any more. The code is now as follows, much cleaner.
I'm leaving the original question here in case my learning process helps anyone, though.
# 1. Find lines containing both "un" and "ɛ̃"
original_lines <- readLines('Test.txt')
lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)
# CHANGE PHONES TO FIND AND PHONES TO ADD
# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines <- character()
for (line in lines_with_pattern)
modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))
# 3. Combine modified lines with original lines
all_lines <- c(original_lines, modified_lines)
# 4. Sort the lines alphabetically
sorted_lines <- sort(all_lines)
# 5. Print the sorted lines
writeLines(sorted_lines, 'myfile.txt', sep = '\\n')
ORIGINAL QUESTION
I am trying to grep a data frame consisting of rows of two columns with tab separation between the columns, e.g.
V1 V2
17 nempruntèrent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18 vemprunté ɑ̃ p ʁ ɛ̃ t e
19 fempruntée ɑ̃ p ʁ ɛ̃ t e
20 wempruntées ɑ̃ p ʁ ɛ̃ t e
21 2empruntés ɑ̃ p ʁ ɛ̃ t e
(excerpt--the last five rows of the data frame. The first col contains dummy French-like words; the second col contains International Phonetic Alphabet transcriptions.)
test <- read.delim('Test.txt', header=FALSE)
print(test)
produces a printout as above, so it looks as if R 'knows' what's in the data frame.
But then I want to grep
for certain strings, so I've tried
# 1. Find lines containing both "un" and "ɛ̃"
lines_with_pattern <- grep("un", test, value = TRUE)
print(lines_with_pattern)
and this doesn't work.
The above grep
results in named character(0)
. This means R is not finding the characters it's looking for, so I've tried
test <- read.delim('Test.txt', header=FALSE)
test <- as.character(test)
I don't think I'm using as.character() right, as that snippet produces e.g.
V1 V2
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA
(again the last five rows of the output)
and therefore print(test)
produces
[1] "c(17, 18, 12, 20, 1)"
[2] "c(8, 6, 6, 6, 6)"
(last five figures in the resulting vectors)
and
lines_with_pattern <- grep("un", test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)
produces character(0)
.
So: I don't understand the vectors that print(test)
produces in the example just above--the numbers don't seem to refer to anything corresponding to the data. And, my original question: what do I need to do to be able to grep
this data-set?
Sorry for the very long message, and for the noob question, but thanks a lot for any help!
答案1
得分: 3
`grep()` 不能直接用于数据框。此外,数据框在逐行操作方面效果不佳,这似乎是您感兴趣的操作。在基础 R 中,我会使用 `apply()`(它会隐式将数据转换为字符矩阵)和 `grepl()` 的组合:
```r
which(apply(
test,
MARGIN = 1,
FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
))
这将给您所有包含 "un" 和 "ɛ̃" 的行。使用 which()
是为了获得行号,而不是逻辑向量(这也非常适用于单独进行子集选择)。
<details>
<summary>英文:</summary>
`grep()` cannot be used as-is on data frames. Moreover, data frame's do not work well for row-wise operations which is what you seem to be interested in doing. In base R, I would use a combination of `apply()` (which will do an implicit conversion of your data to a character matrix) and `grepl()`:
```r
which(apply(
test,
MARGIN = 1,
FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
))
This will give you all the rows where both "un" and "ɛ̃" appear. The use of which()
is to get row numbers rather than a logical vector (which also works perfectly fine for subsetting on its own).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论