英文:
map_reduce not working as expected using langchain
问题
我尝试使用langchain和chatgpt提取关于CSV文件的信息。
如果我只取一小部分代码并使用'stuff'方法,它可以完美工作。但是当我使用整个CSV文件与map_reduce时,它在大多数问题上失败了。
我的当前代码如下:
queries = ["Tell me the name of every driver who is German", "how many german drivers are?", "which driver uses the number 14?", "which driver has the oldest birthdate?"]
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # 读取本地的.env文件
from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
files = ['drivers.csv', 'drivers_full.csv']
for file in files:
print("=====================================")
print(file)
print("=====================================")
with get_openai_callback() as cb:
loader = CSVLoader(file_path=file, encoding='utf-8')
docs = loader.load()
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# 创建用作索引的向量存储
db = Chroma.from_documents(docs, embeddings)
# 在检索器接口中公开此索引
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 1000, "score_threshold": "0.2"})
for query in queries:
qa_stuff = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0, batch_size=20),
chain_type="map_reduce",
retriever=retriever,
verbose=True
)
print(query)
result = qa_stuff.run(query)
print(result)
print(cb)
它在回答有多少德国驾驶员、使用编号14的驾驶员和最老出生日期的驾驶员时失败。此外,成本非常高(8美元!)。
您可以在此处找到代码:https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb
英文:
I am trying to extract information about a csv using langchain and chatgpt.
If I just take a few lines of code and use the 'stuff' method it works perfectly. But when I use the whole csv with the map_reduce it fails in most of questions.
My current code is the following:
queries = ["Tell me the name of every driver who is German","how many german drivers are?", "which driver uses the number 14?", "which driver has the oldest birthdate?"]
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # read local .env file
from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
files = ['drivers.csv','drivers_full.csv']
for file in files:
print("=====================================")
print(file)
print("=====================================")
with get_openai_callback() as cb:
loader = CSVLoader(file_path=file,encoding='utf-8')
docs = loader.load()
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(docs, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":1000, "score_threshold":"0.2"})
for query in queries:
qa_stuff = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0,batch_size=20),
chain_type="map_reduce",
retriever=retriever,
verbose=True
)
print(query)
result = qa_stuff.run(query)
print(result)
print(cb)
If fails in answering how many german drivers are, driver with number 14, oldest birthdate. Also the cost is huge (8$!!!!)
You have the code here:
https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb
答案1
得分: 1
"map_reduce"的工作方式是首先对每个文档调用llm函数("map"部分),然后收集每次调用的答案以生成最终答案("reduce"部分)。请参阅LangChain Map Reduce类型。
LangChain的CSVLoader将CSV数据源拆分成每行成为一个单独文档的方式。这意味着如果您的CSV有10000行,那么它将调用OpenAI API 10001次(10000次用于map,1次用于reduce)。而且,并非所有问题都可以采用map-reduce的方式回答,比如"多少个"、"最大的是什么"等需要数据聚合的问题。
我认为您应该使用"stuff"链式类型。"gpt-3.5-turbo-16k"非常适用,支持16K上下文窗口,而且比您选择的OpenAI更便宜。
请注意,gpt-3.5-turbo-16k是一个聊天模型,因此您必须使用ChatOpenAI而不是OpenAI。
英文:
The way how "map_reduce" works, is that it first calls llm function on each Document (the "map" part), and then collect the answers of each call to produce a final answer (the "reduce" part). see LangChain Map Reduce type
LangChain's CSVLoader splits the CSV data source in such a way that each row becomes a separate document. This means if your CSV has 10000 rows, then it will call OpenAI API 10001 times (10000 for map, and 1 for reduce). And also, not all questions can be answered in the map-reduce way such as "How many", "What is the largest" etc. which requires data aggregation.
I think you have to use the "stuff" chain type. "gpt-3.5-turbo-16k" is good to go, which supports 16K context window and also much cheaper than OpenAI you choose.
Note gpt-3.5-turbo-16k is a chat model so you have to use ChatOpenAI instead of OpenAI.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论