LangChain ConversationalRetrieval与JSONloader

huangapple go评论63阅读模式
英文:

LangChain ConversationalRetrieval with JSONloader

问题

I modified the data loader of this source code https://github.com/techleadhd/chatgpt-retrieval for ConversationalRetrievalChain to accept data as JSON.

我修改了这个源代码的数据加载器https://github.com/techleadhd/chatgpt-retrieval,以便ConversationalRetrievalChain可以接受JSON格式的数据。

I created a dummy JSON file and according to the LangChain documentation, it fits JSON structure as described in the document.

我创建了一个虚拟的JSON文件,并根据LangChain文档,它符合文档中描述的JSON结构。

The code is :

以下是代码:

import os
import sys

import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader

os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY_HERE'

Enable to save to disk & reuse the model (for repeated queries on the same data)

PERSIST = False

query = None
if len(sys.argv) > 1:
query = sys.argv1

if PERSIST and os.path.exists("persist"):
print("Reusing index...\n")
vectorstore = Chroma(persist_directory="persist", embedding_function=OpenAIEmbeddings())
index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:

loader = JSONLoader("data/review.json", jq_schema=".reviews[]", content_key='text') # Use this line if you only need data.json

if PERSIST:
index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"persist"}).from_loaders([loader])
else:
index = VectorstoreIndexCreator().from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
retriever=index.vectorstore.as_retriever()
)

chat_history = []
while True:
if not query:
query = input("Prompt: ")
if query in ['quit', 'q', 'exit']:
sys.exit()
result = chain({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))
query = None

Some examples of results are:

以下是一些结果示例:

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn't meet their expectations in terms of quality.

可以!根据提供的反馈,我们对这些酒店有不同的意见。有人觉得这是一家普通的酒店,没有什么特别之处;另一人在这里享受了卓越的服务和舒适的客房;还有一人被一家酒店愉快地惊艳到,因为这家酒店超出了他们的预期,客房宽敞而干净;最后,还有人对一家价格过高的酒店有失望,这家酒店没有达到他们对质量的期望。

Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.

数据中有四个反馈。

Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.

数据中有四个积极的反馈。

Prompt: how many of them are negative?
There are three negative feedbacks present in the data.

数据中有三个消极的反馈。

Prompt: how many of them are neutral?
Two of the feedbacks are neutral.

有两个反馈是中立的。

Prompt: what is the last review you can see?
The most recent review I can see is: "The hotel exceeded my expectations. The room was spacious and clean."

我能看到的最新评论是:"这家酒店超出了我的期望。客房宽敞而干净。"

Prompt: what is the first review you can see?
The first review I can see is "Highly recommended! The hotel has a beautiful view and the staff is friendly."

我能看到的第一条评论是"强烈推荐!这家酒店有美丽的景色,员工友好。"

Prompt: how many total texts are in the JSON file?
I don't know the answer.

我不知道答案。

I can chat with my data but except for the first answer, all other answers are wrong.

我可以与我的数据交流,但除了第一个答案之外,所有其他答案都是错误的。

Is there a problem with JSONloader or jq_scheme? How can I adapt the code so that I can generate the expected output?

JSONloader或jq_scheme是否存在问题?我应该如何调整代码以生成预期的输出?

(Note: The code you provided appears to be in Python, so I've translated the comments and variable names accordingly.)

英文:

I modified the data loader of this source code <https://github.com/techleadhd/chatgpt-retrieval> for ConversationalRetrievalChain to accept data as JSON.

I created a dummy JSON file and according to the LangChain documentation, it fits JSON structure as described in the document.

{
  &quot;reviews&quot;: [
    {&quot;text&quot;: &quot;Great hotel, excellent service and comfortable rooms.&quot;},
    {&quot;text&quot;: &quot;I had a terrible experience at this hotel. The room was dirty and the staff was rude.&quot;},
    {&quot;text&quot;: &quot;Highly recommended! The hotel has a beautiful view and the staff is friendly.&quot;},
    {&quot;text&quot;: &quot;Average hotel. The room was okay, but nothing special.&quot;},
    {&quot;text&quot;: &quot;I absolutely loved my stay at this hotel. The amenities were top-notch.&quot;},
    {&quot;text&quot;: &quot;Disappointing experience. The hotel was overpriced for the quality provided.&quot;},
    {&quot;text&quot;: &quot;The hotel exceeded my expectations. The room was spacious and clean.&quot;},
    {&quot;text&quot;: &quot;Avoid this hotel at all costs! The customer service was horrendous.&quot;},
    {&quot;text&quot;: &quot;Fantastic hotel with a great location. I would definitely stay here again.&quot;},
    {&quot;text&quot;: &quot;Not a bad hotel, but there are better options available in the area.&quot;}
  ]
}

The code is :

import os
import sys

import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader

os.environ[&quot;OPENAI_API_KEY&quot;] = &#39;YOUR_API_KEY_HERE&#39;

# Enable to save to disk &amp; reuse the model (for repeated queries on the same data)
PERSIST = False

query = None
if len(sys.argv) &gt; 1:
  query = sys.argv[1]


if PERSIST and os.path.exists(&quot;persist&quot;):
  print(&quot;Reusing index...\n&quot;)
  vectorstore = Chroma(persist_directory=&quot;persist&quot;, embedding_function=OpenAIEmbeddings())
  index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:

  loader = JSONLoader(&quot;data/review.json&quot;, jq_schema=&quot;.reviews[]&quot;, content_key=&#39;text&#39;) # Use this line if you only need data.json

  if PERSIST:
    index = VectorstoreIndexCreator(vectorstore_kwargs={&quot;persist_directory&quot;:&quot;persist&quot;}).from_loaders([loader])
  else:
    index = VectorstoreIndexCreator().from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model=&quot;gpt-3.5-turbo&quot;),
  retriever=index.vectorstore.as_retriever()
)

chat_history = []
while True:
  if not query:
    query = input(&quot;Prompt: &quot;)
  if query in [&#39;quit&#39;, &#39;q&#39;, &#39;exit&#39;]:
    sys.exit()
  result = chain({&quot;question&quot;: query, &quot;chat_history&quot;: chat_history})
  print(result[&#39;answer&#39;])

  chat_history.append((query, result[&#39;answer&#39;]))
  query = None

Some examples of results are:

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn&#39;t meet their expectations in terms of quality.

Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.

Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.

Prompt: how many of them are negative?
There are three negative feedbacks present in the data.

Prompt: how many of them are neutral?
Two of the feedbacks are neutral.

Prompt: what is the last review you can see?
The most recent review I can see is: &quot;The hotel exceeded my expectations. The room was spacious and clean.&quot;

Prompt: what is the first review you can see?
The first review I can see is &quot;Highly recommended! The hotel has a beautiful view and the staff is friendly.&quot;

Prompt: how many total texts are in the JSON file?
I don&#39;t know the answer.

I can chat with my data but except for the first answer, all other answers are wrong.

Is there a problem with JSONloader or jq_scheme? How can I adapt the code so that I can generate the expected output?

答案1

得分: 3

ConversationalRetrievalChain中,搜索设置默认为4,请参考../langchain/chains/conversational_retrieval/base.py中的top_k_docs_for_context: int = 4

这是有道理的,因为你不希望将所有向量都发送到LLM模型(这也会增加成本)。根据用例,你可以将默认值更改为更容易管理的值,使用以下代码:

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model="gpt-3.5-turbo"),
  retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10})
)

通过这个更改,你将获得以下结果:

{'question': '数据中有多少份反馈?',
 'chat_history': [],
 'answer': '数据中有10份反馈。'}
英文:

In ConversationalRetrievalChain , search is setup to default 4, refer top_k_docs_for_context: int = 4 in ../langchain/chains/conversational_retrieval/base.py .
LangChain ConversationalRetrieval与JSONloader

That makes sense as you don't want to send all the vectors to LLM model(associated cost too). Based on the usecase, you can change the default to more manageable, using the following:

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model=&quot;gpt-3.5-turbo&quot;),
  retriever=index.vectorstore.as_retriever(search_kwargs={&quot;k&quot;: 10})
)

with this change, you will get the result

{&#39;question&#39;: &#39;how many feedbacks present in the data ?&#39;,
 &#39;chat_history&#39;: [],
 &#39;answer&#39;: &#39;There are 10 pieces of feedback present in the data.&#39;}

huangapple
  • 本文由 发表于 2023年7月12日 20:59:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76670856.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定