2023年5月23日 00:13:50go评论74阅读模式

英文:

llama_index get the document referenced from node_sources

问题

我在使用 llama_index 对已索引的PDF进行检索时获得了良好的结果，但我在找到这些结果所在的PDF文档以便获取答案方面遇到了困难。result.node_sources 使用了一个似乎是内部生成的文档ID。如何获取文档的引用呢？

英文:

I'm getting good results with llama_index having indexed PDFs, however I am having trouble finding which PDF it found the results in to base its answers upon. result.node_sources uses a Doc id which it seems to internally generate. How can I get a reference back to the document?

答案1

得分: 4

以下是翻译好的内容：

直接从Llama团队获取到这个答案 -

感谢您的提问和对LlamaIndex的支持。您可以采取以下几种常见方法：

将元数据注入到每个文档的extra_info中，例如文件名、链接等。许多LlamaHub加载程序应该已经自动将元数据添加到extra_info中，但如果您愿意，也可以自己添加/删除extra_info。这个extra_info会被注入到每个节点中。当您从查询引擎获取响应时，您可以使用response.source_nodes来获取相关的来源。

这些来源将包含原始文本以及元数据。请查看此文档：
https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_documents.html

假设您已经将适当的元数据添加到extra_info字段中，您可以选择修改查询字符串，或者在查询提示中说一些类似于“请在回答中引用来源”的内容。对于查询字符串，您只需追加它，要自定义提示，请查看https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_prompts.html

英文:

Got this answer directly from the Llama team -

Thanks for the questions and for your support of LlamaIndex. There are a few general approaches you can do:

Inject metadata into the extra_info of each Document, such as file name, link, etc. A lot of LlamaHub loaders should already automatically add metadata into the extra_info, but you can add/remove extra_info yourself if you'd like. This extra_info gets injected into each Node. When you get a response from a query engine, you can do response.source_nodes to fetch the relevant sources.

These sources will contain both the original text as well as the metadata. Take a look at this doc:
https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_documents.html

Assuming you add the appropriate metadata to the extra_info field, you can choose to either modify the query string, or the QA/refine prompts and say something like "Please cite sources along with your answer" in either of those.

The query string you can just append to, for customizing prompts, take a look at https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_prompts.html

答案2

得分: 1

他们似乎将 'extra_info' 更改为 'metadata'。

我使用了以下代码，它运行得非常完美：

if hasattr(response, 'metadata'):
    document_info = str(response.metadata)
    find = re.findall(r"'page_label': '[^']*', 'file_name': '[^']*'", document_info)

    print('\n' + '=' * 60 + '\n')
    print('上下文信息')
    print(str(find))
    print('\n' + '=' * 60 + '\n')

英文:

It seems that they changed 'extra_info' to 'metadata'.

I used this code and it works perfectly:

    if hasattr(response, &#39;metadata&#39;):
        document_info = str(response.metadata)
        find = re.findall(r&quot;&#39;page_label&#39;: &#39;[^&#39;]*&#39;, &#39;file_name&#39;: &#39;[^&#39;]*&#39;&quot;, document_info)

        print(&#39;\n&#39;+&#39;=&#39; * 60+&#39;\n&#39;)
        print(&#39;Context Information&#39;)
        print(str(find))
        print(&#39;\n&#39;+&#39;=&#39; * 60+&#39;\n&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取从node_sources引用的文档。

问题

答案1

答案2

如何正确导入llama-index类？

Langchain – 仅依赖上下文

如何将文档列表添加到现有的llama-index索引中？

How to fix `transformers` package not found error in a Python project with `py-langchain`, `llama-index`, and `gradio`?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论