英文:
How can I load scraped page content to langchain VectorstoreIndexCreator
问题
我有一个函数,它访问URL并抓取其内容(包括子页面)。然后我想将文本内容加载到VectorstoreIndexCreator()
中。我应该如何通过加载器来实现呢?在langchain.document_loaders
中找不到合适的加载器。我是否应该使用BaseLoader?如何使用?
我的代码:
import requests
from bs4 import BeautifulSoup
import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):
# 访问URL并获取链接
links = get_links_from_page(company_url)
# get_text_content_from_page 访问URL并生成文本、URL元组
for text, url in get_text_content_from_page(links[:max_crawl_pages]):
# 将文本内容(字符串)添加到索引
# 加载器????
index = VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={"source": url})])
# 最后,查询向量数据库:
DEFAULT_QUERY = "公司做什么?这家公司的关键人物是谁?你能告诉我联系信息吗?"
query = questions or DEFAULT_QUERY
logger.info(f"查询: {query}")
result = index.query_with_sources(query)
logger.info(f"结果:\n {result['answer']}")
logger.info(f"来源:\n {result['sources']}")
return result['answer'], result['sources']
请注意,我无法提供有关get_links_from_page
和get_text_content_from_page
函数的详细信息,因为您没有提供这些函数的代码。如果您需要进一步的帮助,请提供这些函数的实现代码。
英文:
I have a function which goes to url and crawls its content (+ from subpages). Then I want to load text content to langchain VectorstoreIndexCreator()
. How can I do it via loader? I could not find any suitable loader in langchain.document_loaders
. Should I use BaseLoader for it? How?
My code
import requests
from bs4 import BeautifulSoup
import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):
# goes to url and get urls
links = get_links_from_page(company_url)
# get_text_content_from_page goes to url and yields text, url tuple
for text, url in get_text_content_from_page(links[:max_crawl_pages]):
# add text content (string) to index
# loader????
index= VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={"source": url})])
# Finally, query the vector database:
DEFAULT_QUERY = f"What does the company do? Who are key people in this company? Can you tell me contact information?"
query = questions or DEFAULT_QUERY
logger.info(f"Query: {query}")
result = index.query_with_sources(query)
logger.info(f"Result:\n {result['answer']}")
logger.info(f"Sources:\n {result['sources']}")
return result['answer'], result['sources']
答案1
得分: 3
是的,您可以使用WebBaseLoader,它在后台使用BeautifulSoup
来解析数据。
查看以下示例:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader(your_url)
scrape_data = loader.load()
您可以通过传递URL数组来处理多个网页,如下所示:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()
要同时加载多个网页,您可以使用aload()
方法。
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # <-------- 在这里
如果您已经有一个运行中的asynio事件循环,可能会遇到并发加载时出现一些问题,会抛出类似于"nested event loop error"或"RuntimeError: This event loop is already running"之类的错误。您可以使用nest_asyncio库来解决此问题,它是一个允许嵌套事件循环的补丁。请参阅以下示例:
import nest_asyncio
nest_asyncio.apply()
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()
英文:
Yes, you can use the WebBaseLoader which usages BeautifulSoup
behind the scene to parse the data.
See the below sample:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader(your_url)
scrape_data = loader.load()
you can do multiple web pages by passing an array of URLs like below:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()
And to load multiple web pages concurrently, you can use the aload()
method.
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # <-------- here
You may encounter some issues with loading concurrently if you already have a running asynio event loop which will throw an error something like "nested event loop error"
or "RuntimeError: This event loop is already running"
something like that. You can resolve this issue by using nest_asyncio library which is a patch to allow nested event loops. See the sample below:
import nest_asyncio
nest_asyncio.apply()
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论