问题

我尝试使用langchain和chatgpt提取关于CSV文件的信息。

如果我只取一小部分代码并使用'stuff'方法，它可以完美工作。但是当我使用整个CSV文件与map_reduce时，它在大多数问题上失败了。

我的当前代码如下：

queries = ["Tell me the name of every driver who is German", "how many german drivers are?", "which driver uses the number 14?", "which driver has the oldest birthdate?"]

import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # 读取本地的.env文件

from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma

files = ['drivers.csv', 'drivers_full.csv']

for file in files:
    print("=====================================")
    print(file)
    print("=====================================")
    with get_openai_callback() as cb:

        loader = CSVLoader(file_path=file, encoding='utf-8')
        docs = loader.load()

        from langchain.embeddings.openai import OpenAIEmbeddings

        embeddings = OpenAIEmbeddings()

        # 创建用作索引的向量存储
        db = Chroma.from_documents(docs, embeddings)
        # 在检索器接口中公开此索引
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 1000, "score_threshold": "0.2"})

        for query in queries:
            qa_stuff = RetrievalQA.from_chain_type(
                llm=OpenAI(temperature=0, batch_size=20), 
                chain_type="map_reduce", 
                retriever=retriever,
                verbose=True
            )

            print(query)
            result = qa_stuff.run(query)

            print(result)
            
        print(cb)

它在回答有多少德国驾驶员、使用编号14的驾驶员和最老出生日期的驾驶员时失败。此外，成本非常高（8美元！）。

您可以在此处找到代码：https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb

英文:

I am trying to extract information about a csv using langchain and chatgpt.

If I just take a few lines of code and use the 'stuff' method it works perfectly. But when I use the whole csv with the map_reduce it fails in most of questions.

My current code is the following:

queries = [&quot;Tell me the name of every driver who is German&quot;,&quot;how many german drivers are?&quot;,  &quot;which driver uses the number 14?&quot;, &quot;which driver has the oldest birthdate?&quot;]

import os

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # read local .env file

from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma

files = [&#39;drivers.csv&#39;,&#39;drivers_full.csv&#39;]

for file in files:
    print(&quot;=====================================&quot;)
    print(file)
    print(&quot;=====================================&quot;)
    with get_openai_callback() as cb:

        loader = CSVLoader(file_path=file,encoding=&#39;utf-8&#39;)
        docs = loader.load()

        from langchain.embeddings.openai import OpenAIEmbeddings

        embeddings = OpenAIEmbeddings()

        # create the vectorestore to use as the index
        db = Chroma.from_documents(docs, embeddings)
        # expose this index in a retriever interface
        retriever = db.as_retriever(search_type=&quot;similarity&quot;, search_kwargs={&quot;k&quot;:1000, &quot;score_threshold&quot;:&quot;0.2&quot;})

        for query in queries:
            qa_stuff = RetrievalQA.from_chain_type(
                llm=OpenAI(temperature=0,batch_size=20), 
                chain_type=&quot;map_reduce&quot;, 
                retriever=retriever,
                verbose=True
            )

            print(query)
            result = qa_stuff.run(query)

            print(result)
            
        print(cb)

If fails in answering how many german drivers are, driver with number 14, oldest birthdate. Also the cost is huge (8$!!!!)

You have the code here:
https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb

答案1

得分: 1

"map_reduce"的工作方式是首先对每个文档调用llm函数（"map"部分），然后收集每次调用的答案以生成最终答案（"reduce"部分）。请参阅LangChain Map Reduce类型。

LangChain的CSVLoader将CSV数据源拆分成每行成为一个单独文档的方式。这意味着如果您的CSV有10000行，那么它将调用OpenAI API 10001次（10000次用于map，1次用于reduce）。而且，并非所有问题都可以采用map-reduce的方式回答，比如"多少个"、"最大的是什么"等需要数据聚合的问题。

我认为您应该使用"stuff"链式类型。"gpt-3.5-turbo-16k"非常适用，支持16K上下文窗口，而且比您选择的OpenAI更便宜。

请注意，gpt-3.5-turbo-16k是一个聊天模型，因此您必须使用ChatOpenAI而不是OpenAI。

英文:

The way how "map_reduce" works, is that it first calls llm function on each Document (the "map" part), and then collect the answers of each call to produce a final answer (the "reduce" part). see LangChain Map Reduce type

LangChain's CSVLoader splits the CSV data source in such a way that each row becomes a separate document. This means if your CSV has 10000 rows, then it will call OpenAI API 10001 times (10000 for map, and 1 for reduce). And also, not all questions can be answered in the map-reduce way such as "How many", "What is the largest" etc. which requires data aggregation.

I think you have to use the "stuff" chain type. "gpt-3.5-turbo-16k" is good to go, which supports 16K context window and also much cheaper than OpenAI you choose.

Note gpt-3.5-turbo-16k is a chat model so you have to use ChatOpenAI instead of OpenAI.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

map_reduce not working as expected using langchain

问题

答案1

如何在ConversationalRetrievalChain上添加自定义提示模板？

Unable to read text data file using TextLoader from langchain.document_loaders library because of encoding issue

OpenAI API error: Why do I still get the "module 'openai' has no attribute 'ChatCompletion'" error after I upgraded the OpenAI package and Python?

OpenAI的GPT Davinci – 询问它问题，但它返回的是无意义的内容？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论