map_reduce not working as expected using langchain

huangapple go评论75阅读模式
英文:

map_reduce not working as expected using langchain

问题

我尝试使用langchain和chatgpt提取关于CSV文件的信息。

如果我只取一小部分代码并使用'stuff'方法,它可以完美工作。但是当我使用整个CSV文件与map_reduce时,它在大多数问题上失败了。

我的当前代码如下:

queries = ["Tell me the name of every driver who is German", "how many german drivers are?", "which driver uses the number 14?", "which driver has the oldest birthdate?"]

import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # 读取本地的.env文件

from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma

files = ['drivers.csv', 'drivers_full.csv']

for file in files:
    print("=====================================")
    print(file)
    print("=====================================")
    with get_openai_callback() as cb:

        loader = CSVLoader(file_path=file, encoding='utf-8')
        docs = loader.load()

        from langchain.embeddings.openai import OpenAIEmbeddings

        embeddings = OpenAIEmbeddings()

        # 创建用作索引的向量存储
        db = Chroma.from_documents(docs, embeddings)
        # 在检索器接口中公开此索引
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 1000, "score_threshold": "0.2"})

        for query in queries:
            qa_stuff = RetrievalQA.from_chain_type(
                llm=OpenAI(temperature=0, batch_size=20), 
                chain_type="map_reduce", 
                retriever=retriever,
                verbose=True
            )

            print(query)
            result = qa_stuff.run(query)

            print(result)
            
        print(cb)

它在回答有多少德国驾驶员、使用编号14的驾驶员和最老出生日期的驾驶员时失败。此外,成本非常高(8美元!)。

您可以在此处找到代码:https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb

英文:

I am trying to extract information about a csv using langchain and chatgpt.

If I just take a few lines of code and use the 'stuff' method it works perfectly. But when I use the whole csv with the map_reduce it fails in most of questions.

My current code is the following:

queries = ["Tell me the name of every driver who is German","how many german drivers are?",  "which driver uses the number 14?", "which driver has the oldest birthdate?"]

import os

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # read local .env file

from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma

files = ['drivers.csv','drivers_full.csv']

for file in files:
    print("=====================================")
    print(file)
    print("=====================================")
    with get_openai_callback() as cb:

        loader = CSVLoader(file_path=file,encoding='utf-8')
        docs = loader.load()

        from langchain.embeddings.openai import OpenAIEmbeddings

        embeddings = OpenAIEmbeddings()

        # create the vectorestore to use as the index
        db = Chroma.from_documents(docs, embeddings)
        # expose this index in a retriever interface
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":1000, "score_threshold":"0.2"})

        for query in queries:
            qa_stuff = RetrievalQA.from_chain_type(
                llm=OpenAI(temperature=0,batch_size=20), 
                chain_type="map_reduce", 
                retriever=retriever,
                verbose=True
            )

            print(query)
            result = qa_stuff.run(query)

            print(result)
            
        print(cb)

If fails in answering how many german drivers are, driver with number 14, oldest birthdate. Also the cost is huge (8$!!!!)

You have the code here:
https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb

答案1

得分: 1

"map_reduce"的工作方式是首先对每个文档调用llm函数("map"部分),然后收集每次调用的答案以生成最终答案("reduce"部分)。请参阅LangChain Map Reduce类型

LangChain的CSVLoader将CSV数据源拆分成每行成为一个单独文档的方式。这意味着如果您的CSV有10000行,那么它将调用OpenAI API 10001次(10000次用于map,1次用于reduce)。而且,并非所有问题都可以采用map-reduce的方式回答,比如"多少个"、"最大的是什么"等需要数据聚合的问题。

我认为您应该使用"stuff"链式类型。"gpt-3.5-turbo-16k"非常适用,支持16K上下文窗口,而且比您选择的OpenAI更便宜。

请注意,gpt-3.5-turbo-16k是一个聊天模型,因此您必须使用ChatOpenAI而不是OpenAI。

英文:

The way how "map_reduce" works, is that it first calls llm function on each Document (the "map" part), and then collect the answers of each call to produce a final answer (the "reduce" part). see LangChain Map Reduce type

LangChain's CSVLoader splits the CSV data source in such a way that each row becomes a separate document. This means if your CSV has 10000 rows, then it will call OpenAI API 10001 times (10000 for map, and 1 for reduce). And also, not all questions can be answered in the map-reduce way such as "How many", "What is the largest" etc. which requires data aggregation.

I think you have to use the "stuff" chain type. "gpt-3.5-turbo-16k" is good to go, which supports 16K context window and also much cheaper than OpenAI you choose.

Note gpt-3.5-turbo-16k is a chat model so you have to use ChatOpenAI instead of OpenAI.

huangapple
  • 本文由 发表于 2023年6月12日 19:05:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76456041.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定