2023年6月5日 19:49:24go评论150阅读模式

英文:

r doesn't see data frame as characters; can't grep; wrong use of as.character()?

问题

编辑，一天后：对这个问题的答案告诉我我需要相当大幅度地编辑我的代码。所以基本上这个问题现在已经不存在了，因为我不再在数据框中使用grep了。现在的代码如下，更加清晰。

我将原始问题保留在这里，以防我的学习过程能帮助任何人，尽管如此。

# 1. Find lines containing both "un" and "ɛ̃"
original_lines <- readLines('Test.txt')
lines_with_pattern <- grep('un.*ɛ̃', original_lines, value = TRUE)
# CHANGE PHONES TO FIND AND PHONES TO ADD
# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines <- character()
for (line in lines_with_pattern)
  modified_lines <- c(modified_lines, gsub("ɛ̃", "œ̃", line))
# 3. Combine modified lines with original lines
all_lines <- c(original_lines, modified_lines)
# 4. Sort the lines alphabetically
sorted_lines <- sort(all_lines)
# 5. Print the sorted lines
writeLines(sorted_lines, 'myfile.txt', sep = '\n')

原始问题：

我试图在一个由两列组成的数据框中使用grep，两列之间以制表符分隔，例如：

              V1            V2
17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18     vemprunt&#233;   ɑ̃ p ʁ ɛ̃ t e
19    femprunt&#233;e   ɑ̃ p ʁ ɛ̃ t e
20   wemprunt&#233;es   ɑ̃ p ʁ ɛ̃ t e
21    2emprunt&#233;s   ɑ̃ p ʁ ɛ̃ t e

（摘录--数据框的最后五行。第一列包含类似法语的虚构单词；第二列包含国际音标字母表的音标。）

test <- read.delim('Test.txt', header=FALSE)
print(test)

产生如上所示的输出，所以看起来R“知道”数据框中有什么。

但是然后我想要在其中使用grep来查找特定的字符串，所以我尝试了以下操作：

# 1. Find lines containing both "un" and "ɛ̃"
lines_with_pattern <- grep("un", test, value = TRUE)
print(lines_with_pattern)

这不起作用。

上面的grep结果是named character(0)。这意味着R找不到它正在寻找的字符，因此我尝试了以下操作：

test <- read.delim('Test.txt', header=FALSE)
test <- as.character(test)

我认为我没有正确使用as.character()，因为这段代码产生了例如：

     V1 V2
 
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA

（再次是输出的最后五行）

因此print(test)产生了：

[1] "c(17, 18, 12, 20, 1)"
[2] "c(8, 6, 6, 6, 6)"

（结果向量中的最后五个数字）

和

lines_with_pattern <- grep("un", test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)

产生了character(0)。

所以：我不理解print(test)产生的向量，上面的例子中的数字似乎与数据对应的任何内容都不符。而且，我的原始问题是：我需要做什么才能在这个数据集中使用grep？

抱歉信息很长，也抱歉提了一个初学者的问题，但非常感谢任何帮助！

英文:

EDIT, A DAY LATER: The answers to this question have shown me that I needed to edit my code quite substantially. So basically the question has now gone away because I'm not grepping in a data-frame any more. The code is now as follows, much cleaner.

I'm leaving the original question here in case my learning process helps anyone, though.

# 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
original_lines &lt;- readLines(&#39;Test.txt&#39;)
lines_with_pattern &lt;- grep(&#39;un.*ɛ̃&#39;, original_lines, value = TRUE)
# CHANGE PHONES TO FIND AND PHONES TO ADD
# 2. Duplicate the line in which the pattern occurs and change the relevant phoneme
modified_lines &lt;- character()
for (line in lines_with_pattern)
  modified_lines &lt;- c(modified_lines, gsub(&quot;ɛ̃&quot;, &quot;œ̃&quot;, line))
# 3. Combine modified lines with original lines
all_lines &lt;- c(original_lines, modified_lines)
# 4. Sort the lines alphabetically
sorted_lines &lt;- sort(all_lines)
# 5. Print the sorted lines
writeLines(sorted_lines, &#39;myfile.txt&#39;, sep = &#39;\\n&#39;)

ORIGINAL QUESTION

I am trying to grep a data frame consisting of rows of two columns with tab separation between the columns, e.g.

              V1            V2
17 nemprunt&#232;rent ɑ̃ p ʁ ɛ̃ t ɛ ʁ
18     vemprunt&#233;   ɑ̃ p ʁ ɛ̃ t e
19    femprunt&#233;e   ɑ̃ p ʁ ɛ̃ t e
20   wemprunt&#233;es   ɑ̃ p ʁ ɛ̃ t e
21    2emprunt&#233;s   ɑ̃ p ʁ ɛ̃ t e

(excerpt--the last five rows of the data frame. The first col contains dummy French-like words; the second col contains International Phonetic Alphabet transcriptions.)

test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
print(test)

produces a printout as above, so it looks as if R 'knows' what's in the data frame.

But then I want to grep for certain strings, so I've tried

# 1. Find lines containing both &quot;un&quot; and &quot;ɛ̃&quot;
lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
print(lines_with_pattern)

and this doesn't work.

The above grep results in named character(0). This means R is not finding the characters it's looking for, so I've tried

test &lt;- read.delim(&#39;Test.txt&#39;, header=FALSE)
test &lt;- as.character(test)

I don't think I'm using as.character() right, as that snippet produces e.g.

     V1 V2
 
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA NA
[21,] NA NA

(again the last five rows of the output)

and therefore print(test) produces

[1] &quot;c(17, 18, 12, 20, 1)&quot;
[2] &quot;c(8, 6, 6, 6, 6)&quot;

(last five figures in the resulting vectors)

and

lines_with_pattern &lt;- grep(&quot;un&quot;, test, value = TRUE)
# value = TRUE, fixed = FALSE, useBytes = TRUE, invert = FALSE)
print(lines_with_pattern)

produces character(0).

So: I don't understand the vectors that print(test) produces in the example just above--the numbers don't seem to refer to anything corresponding to the data. And, my original question: what do I need to do to be able to grep this data-set?

Sorry for the very long message, and for the noob question, but thanks a lot for any help!

答案1

得分: 3

`grep()` 不能直接用于数据框。此外，数据框在逐行操作方面效果不佳，这似乎是您感兴趣的操作。在基础 R 中，我会使用 `apply()`（它会隐式将数据转换为字符矩阵）和 `grepl()` 的组合：
```r
which(apply(
  test,
  MARGIN = 1,
  FUN = function(x) any(grepl("un", x)) & any(grepl("ɛ̃", x))
))

这将给您所有包含 "un" 和 "ɛ̃" 的行。使用 which() 是为了获得行号，而不是逻辑向量（这也非常适用于单独进行子集选择）。


<details>
<summary>英文:</summary>
`grep()` cannot be used as-is on data frames. Moreover, data frame&#39;s do not work well for row-wise operations which is what you seem to be interested in doing. In base R, I would use a combination of `apply()` (which will do an implicit conversion of your data to a character matrix) and `grepl()`:
```r
which(apply(
  test,
  MARGIN = 1,
  FUN = function(x) any(grepl(&quot;un&quot;, x)) &amp; any(grepl(&quot;ɛ̃&quot;, x))
))

This will give you all the rows where both "un" and "ɛ̃" appear. The use of which() is to get row numbers rather than a logical vector (which also works perfectly fine for subsetting on its own).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

r不将数据框视为字符；无法使用grep；as.character()的使用错误？

问题

答案1

在ggplot2中并排绘制因子。

如何使用`df.resample`处理离散时间？

“使用pandas DataFrame写入数据时出现“没有’append’属性”错误”

连接分组中的抖动点

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。