上传文本文档到 R

huangapple go评论89阅读模式
英文:

Upload text document in R

问题

我正在尝试将多个文本文档上传到R中的数据框中。我的期望输出是一个具有两列的矩阵:

DOCUMENT CONTENT
Document A 这是内容。
Document B 这是内容。
Document C 这是内容。

在“CONTENT”列中,应显示来自文本文档(10-K报告)的所有文本信息。

  1. > setwd("C:/Users/folder")
  2. > folder <- getwd()
  3. > corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))

这将创建一个语料库,然后可以对其进行分词。但我无法将其转换为数据框或获得我期望的输出。

有人可以帮我吗?

英文:

I am trying to upload several text document into a data frame in R. My desired output is a matrix with two colums:

| DOCUMENT | CONTENT |
|||
| Document A | This is the content.|
|: ---- |: -------:| ------:|
| Document B | This is the content.|
|: ---- |: -------:| ------:|
| Document C | This is the content.|

Within the column "CONTENT", all the text information from the text document (10-K report) shall be shown.

  1. &gt; setwd(&quot;C:/Users/folder&quot;)
  2. &gt; folder &lt;- getwd()
  3. &gt; corpus &lt;- Corpus(DirSource(directory = folder, pattern = &quot;*.txt&quot;))

This will create a corpus and I can tokenize it. But I don't achieve to convert to a data frame nor my desiret output.

Can somebody help me?

答案1

得分: 1

如果您只处理.txt文件并且最终目标是一个数据框,那么我认为您可以跳过语料库步骤,只需将所有文件读入列表中。难点在于将.txt文件的名称放入名为DOCUMENT的列中,但这可以在基本的R中完成。

  1. # 创建一个可复现的示例
  2. a <- "this is a test"
  3. b <- "this is a second test"
  4. c <- "this is a third test"
  5. write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")
  6. # 获取工作目录
  7. folder <- getwd()
  8. # 获取所有文件的名称/位置
  9. filelist <- list.files(path = folder, pattern = " *.txt", full.names = FALSE)
  10. # 读入文件并将它们放入列表中
  11. lst <- lapply(filelist, readLines)
  12. # 提取不带`.txt`后缀的文件名
  13. names(lst) <- filelist
  14. namelist <- fs::path_file(filelist)
  15. namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""),
  16. use.names = FALSE)
  17. # 为列表中的每个矩阵指定其原始文件名作为名称
  18. lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)
  19. # 合并为数据框
  20. x <- do.call(rbind.data.frame, lst)
  21. # 进行一些清理
  22. rownames(x) <- NULL
  23. names(x)[names(x) == "V1"] <- "CONTENT"
  24. x <- x[,c(2,1)]
  25. x
  26. #> DOCUMENT CONTENT
  27. #> 1 a this is a test
  28. #> 2 b this is a second test
  29. #> 3 c this is a third test

这是您提供的代码的翻译部分。

英文:

If you're only working with .txt files and your endgoal is a dataframe, then I think you can skip the corpus step and simply read in all your files as a list. The hard part is to get the names of the .txt files into a column called DOCUMENT, but this can be done in base R.

  1. # make a reproducible example
  2. a &lt;- &quot;this is a test&quot;
  3. b &lt;- &quot;this is a second test&quot;
  4. c &lt;- &quot;this is a third test&quot;
  5. write(a, &quot;a.txt&quot;); write(b, &quot;b.txt&quot;); write(c, &quot;c.txt&quot;)
  6. # get working dir
  7. folder &lt;- getwd()
  8. # get names/locations of all files
  9. filelist &lt;- list.files(path = folder, pattern =&quot; *.txt&quot;, full.names = FALSE)
  10. # read in the files and put them in a list
  11. lst &lt;- lapply(filelist, readLines)
  12. # extract the names of the files without the `.txt` stuff
  13. names(lst) &lt;- filelist
  14. namelist &lt;- fs::path_file(filelist)
  15. namelist &lt;- unlist(lapply(namelist, sub, pattern = &quot;.txt&quot;, replacement = &quot;&quot;),
  16. use.names = FALSE)
  17. # give every matrix in the list its own name, which was its original file name
  18. lst &lt;- mapply(cbind, lst, &quot;DOCUMENT&quot; = namelist, SIMPLIFY = FALSE)
  19. # combine into a dataframe
  20. x &lt;- do.call(rbind.data.frame, lst)
  21. # a small amount of clean-up
  22. rownames(x) &lt;- NULL
  23. names(x)[names(x) == &quot;V1&quot;] &lt;- &quot;CONTENT&quot;
  24. x &lt;- x[,c(2,1)]
  25. x
  26. #&gt; DOCUMENT CONTENT
  27. #&gt; 1 a this is a test
  28. #&gt; 2 b this is a second test
  29. #&gt; 3 c this is a third test

huangapple
  • 本文由 发表于 2023年1月8日 21:47:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75048249.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定