2023年1月8日 21:47:44go评论89阅读模式

英文:

Upload text document in R

问题

我正在尝试将多个文本文档上传到R中的数据框中。我的期望输出是一个具有两列的矩阵：

DOCUMENT	CONTENT
Document A	这是内容。
Document B	这是内容。
Document C	这是内容。

在“CONTENT”列中，应显示来自文本文档（10-K报告）的所有文本信息。

> setwd("C:/Users/folder")
> folder <- getwd()
> corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))

这将创建一个语料库，然后可以对其进行分词。但我无法将其转换为数据框或获得我期望的输出。

有人可以帮我吗？

英文:

I am trying to upload several text document into a data frame in R. My desired output is a matrix with two colums:

| DOCUMENT | CONTENT |
|||
| Document A | This is the content.|
|: ---- |: -------:| ------:|
| Document B | This is the content.|
|: ---- |: -------:| ------:|
| Document C | This is the content.|

Within the column "CONTENT", all the text information from the text document (10-K report) shall be shown.

&gt; setwd(&quot;C:/Users/folder&quot;)
&gt; folder &lt;- getwd()
&gt; corpus &lt;- Corpus(DirSource(directory = folder, pattern = &quot;*.txt&quot;))

This will create a corpus and I can tokenize it. But I don't achieve to convert to a data frame nor my desiret output.

Can somebody help me?

答案1

得分: 1

如果您只处理.txt文件并且最终目标是一个数据框，那么我认为您可以跳过语料库步骤，只需将所有文件读入列表中。难点在于将.txt文件的名称放入名为DOCUMENT的列中，但这可以在基本的R中完成。

# 创建一个可复现的示例
a <- "this is a test"
b <- "this is a second test"
c <- "this is a third test"
write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")
# 获取工作目录
folder <- getwd()
# 获取所有文件的名称/位置
filelist <- list.files(path = folder, pattern = " *.txt", full.names = FALSE)
# 读入文件并将它们放入列表中
lst <- lapply(filelist, readLines)
# 提取不带`.txt`后缀的文件名
names(lst) <- filelist
namelist <- fs::path_file(filelist)
namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""), 
                   use.names = FALSE)
# 为列表中的每个矩阵指定其原始文件名作为名称
lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)
# 合并为数据框
x <- do.call(rbind.data.frame, lst) 
# 进行一些清理
rownames(x) <- NULL
names(x)[names(x) == "V1"] <- "CONTENT"
x <- x[,c(2,1)]
x
#>   DOCUMENT               CONTENT
#> 1        a        this is a test
#> 2        b this is a second test
#> 3        c  this is a third test

这是您提供的代码的翻译部分。

英文:

If you're only working with .txt files and your endgoal is a dataframe, then I think you can skip the corpus step and simply read in all your files as a list. The hard part is to get the names of the .txt files into a column called DOCUMENT, but this can be done in base R.

# make a reproducible example
a &lt;- &quot;this is a test&quot;
b &lt;- &quot;this is a second test&quot;
c &lt;- &quot;this is a third test&quot;
write(a, &quot;a.txt&quot;); write(b, &quot;b.txt&quot;); write(c, &quot;c.txt&quot;)
# get working dir
folder &lt;- getwd()
# get names/locations of all files
filelist &lt;- list.files(path = folder, pattern =&quot; *.txt&quot;, full.names = FALSE)
# read in the files and put them in a list
lst &lt;- lapply(filelist, readLines)
# extract the names of the files without the `.txt` stuff
names(lst) &lt;- filelist
namelist &lt;- fs::path_file(filelist)
namelist &lt;- unlist(lapply(namelist, sub, pattern = &quot;.txt&quot;, replacement = &quot;&quot;), 
                   use.names = FALSE)
# give every matrix in the list its own name, which was its original file name
lst &lt;- mapply(cbind, lst, &quot;DOCUMENT&quot; = namelist, SIMPLIFY = FALSE)
# combine into a dataframe
x &lt;- do.call(rbind.data.frame, lst) 
# a small amount of clean-up
rownames(x) &lt;- NULL
names(x)[names(x) == &quot;V1&quot;] &lt;- &quot;CONTENT&quot;
x &lt;- x[,c(2,1)]
x
#&gt;   DOCUMENT               CONTENT
#&gt; 1        a        this is a test
#&gt; 2        b this is a second test
#&gt; 3        c  this is a third test

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

上传文本文档到 R

问题

答案1

使用ggscatter按组着色。

如何将这个ASCII文本文件转换为可用的数据格式？

不同的卡方检验数值在R和在线计算器中

如何按正确的日期顺序对list.files()进行排序？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。