2023年6月16日 11:32:58go评论164阅读模式

英文:

How to load and split from a list of File objects

问题

我正在创建一个JavaScript应用程序，其中有一个可拖放文件的区域，您可以从驱动器中拖放文件。
当文件被拖放时，我会获得一个File对象数组。
现在我想使用langchain document loader来加载这些文件，然后将它们拆分成块。这是我目前的函数：

import { TextLoader } from 'langchain/document_loaders/fs/text'
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'
import { Document } from 'langchain/document'

export async function IngestFiles (files) {
  if (files.length < 1) return

  console.log('files', files)

  const splitter = new RecursiveCharacterTextSplitter(
    { chunkSize: 100, chunkOverlap: 10 }
  )

  let documents = []
  files.forEach(async file => {
    const loader = new TextLoader(file)
    const doc = await loader.load()
    const docOutput = await splitter.splitDocuments([
      new Document({ pageContent: doc[0].pageContent })
    ])
    documents = documents.concat(docOutput)

    console.log('documents', documents)
  })

  console.log('result', documents)

  return documents
}

我添加了一些console.log行以便查看中间步骤：

正如您所看到的，我添加了两个小的txt文件，它们被正确加载并拆分成较小的Document对象，但最终结果（最后的console.log）为空。我已尝试了一切，我现在能想到的唯一问题可能与async/await有关，但我看不到问题。

感谢任何帮助。

英文:

I'm creating a JavaScript app that has a drop area where you can drop files from your drive.
When the files are drop, I get an array of File objects.
Now I want to use langchain document loader to load these files and then split them into chunks. This is the function I have so far:

import { TextLoader } from &#39;langchain/document_loaders/fs/text&#39;
import { RecursiveCharacterTextSplitter } from &#39;langchain/text_splitter&#39;
import { Document } from &#39;langchain/document&#39;

export async function IngestFiles (files) {
  if (files.length &lt; 1) return

  console.log(&#39;files&#39;, files)

  const splitter = new RecursiveCharacterTextSplitter(
    { chunkSize: 100, chunkOverlap: 10 }
  )

  let documents = []
  files.forEach(async file =&gt; {
    const loader = new TextLoader(file)
    const doc = await loader.load()
    const docOutput = await splitter.splitDocuments([
      new Document({ pageContent: doc[0].pageContent })
    ])
    documents = documents.concat(docOutput)

    console.log(&#39;documents&#39;, documents)
  })

  console.log(&#39;result&#39;, documents)

  return documents
}

I have added some console.log lines to be able to see the intermediate steps:

As you can see, I added two small txt files, they are properly loaded and split into smaller Document objects, but then the final result (last copnsole.log) is empty. I've tried everything and all I can think now is that this is related to the async/await but I can't see the issue.

Any help is appreciated

答案1

得分: 2

我认为这篇帖子回答了你的问题: https://stackoverflow.com/a/70946414/9787476

作为帖子中建议的解决方案，不要使用 forEach，而是使用 for-of 循环。

另外，有没有特定的原因要使用：

const docOutput = await splitter.splitDocuments([
      new Document({ pageContent: doc[0].pageContent })
    ])

而不是简单地使用

const docOutput = await splitter.splitDocuments(doc)

英文:

I think this post answers your question: https://stackoverflow.com/a/70946414/9787476

As a suggested solution in the post, don't use forEach, but use a for-of loop.

Also is there a specific reason to use:

const docOutput = await splitter.splitDocuments([
      new Document({ pageContent: doc[0].pageContent })
    ])

instead of simply

const docOutput = await splitter.splitDocuments(doc)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从文件对象列表中加载并拆分数据。

问题

答案1

同时运行异步函数未更新状态变量

Detecting Page Load or Reload Cancel Event 检测页面加载或重新加载取消事件

使用正则表达式进行搜索文本高亮，即使文本中包含逗号。

Why are messages not appearing when using connect-flash with res.locals in express.js and ejs templating?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论