2023年6月8日 06:02:34go评论186阅读模式

英文:

How can I load scraped page content to langchain VectorstoreIndexCreator

问题

我有一个函数，它访问URL并抓取其内容（包括子页面）。然后我想将文本内容加载到VectorstoreIndexCreator()中。我应该如何通过加载器来实现呢？在langchain.document_loaders中找不到合适的加载器。我是否应该使用BaseLoader？如何使用？

我的代码：

import requests
from bs4 import BeautifulSoup

import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator


def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):

    # 访问URL并获取链接
    links = get_links_from_page(company_url)

    # get_text_content_from_page 访问URL并生成文本、URL元组
    for text, url in get_text_content_from_page(links[:max_crawl_pages]): 
        # 将文本内容（字符串）添加到索引
        # 加载器????

    index = VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={"source": url})])

    # 最后，查询向量数据库：
    DEFAULT_QUERY = "公司做什么？这家公司的关键人物是谁？你能告诉我联系信息吗？"
    query = questions or DEFAULT_QUERY
    logger.info(f"查询: {query}")
    result = index.query_with_sources(query)

    logger.info(f"结果:\n {result['answer']}")
    logger.info(f"来源:\n {result['sources']}")

    return result['answer'], result['sources']

请注意，我无法提供有关get_links_from_page和get_text_content_from_page函数的详细信息，因为您没有提供这些函数的代码。如果您需要进一步的帮助，请提供这些函数的实现代码。

英文:

I have a function which goes to url and crawls its content (+ from subpages). Then I want to load text content to langchain VectorstoreIndexCreator() . How can I do it via loader? I could not find any suitable loader in langchain.document_loaders. Should I use BaseLoader for it? How?

My code

import requests
from bs4 import BeautifulSoup

import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator


def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):

    # goes to url and get urls 
    links = get_links_from_page(company_url)

    # get_text_content_from_page goes to url and yields text, url tuple
    for text, url in get_text_content_from_page(links[:max_crawl_pages]): 
        # add text content (string) to index
        # loader????

    index= VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={&quot;source&quot;: url})])

    # Finally, query the vector database:
    DEFAULT_QUERY = f&quot;What does the company do? Who are key people in this company? Can you tell me contact information?&quot;
    query = questions or DEFAULT_QUERY
    logger.info(f&quot;Query: {query}&quot;)
    result = index.query_with_sources(query)

    logger.info(f&quot;Result:\n {result[&#39;answer&#39;]}&quot;)
    logger.info(f&quot;Sources:\n {result[&#39;sources&#39;]}&quot;)

    return result[&#39;answer&#39;], result[&#39;sources&#39;]

答案1

得分: 3

是的，您可以使用WebBaseLoader，它在后台使用BeautifulSoup来解析数据。

查看以下示例：

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(your_url)
scrape_data = loader.load()

您可以通过传递URL数组来处理多个网页，如下所示：

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()

要同时加载多个网页，您可以使用aload()方法。

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # &lt;-------- 在这里

如果您已经有一个运行中的asynio事件循环，可能会遇到并发加载时出现一些问题，会抛出类似于"nested event loop error"或"RuntimeError: This event loop is already running"之类的错误。您可以使用nest_asyncio库来解决此问题，它是一个允许嵌套事件循环的补丁。请参阅以下示例：

import nest_asyncio

nest_asyncio.apply()

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()

英文:

Yes, you can use the WebBaseLoader which usages BeautifulSoup behind the scene to parse the data.

See the below sample:

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(your_url)
scrape_data = loader.load()

you can do multiple web pages by passing an array of URLs like below:

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()

And to load multiple web pages concurrently, you can use the aload() method.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # &lt;-------- here

You may encounter some issues with loading concurrently if you already have a running asynio event loop which will throw an error something like "nested event loop error" or "RuntimeError: This event loop is already running" something like that. You can resolve this issue by using nest_asyncio library which is a patch to allow nested event loops. See the sample below:

import nest_asyncio

nest_asyncio.apply()

loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将抓取的页面内容加载到Langchain的VectorstoreIndexCreator中？

问题

答案1

如何在Linux Mint上将Python 3.11.3降级到Python 3.9。

How to solve the error occurred when I try to use function in tf_conversions in ROS1-melodic in a python3 environment

Web Scraping Yahoo Finance Python

Llama_index在ChatGPT模型Python上出现意外的关键字参数错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论