2023年7月12日 20:59:31go评论116阅读模式

英文:

LangChain ConversationalRetrieval with JSONloader

问题

I modified the data loader of this source code https://github.com/techleadhd/chatgpt-retrieval for ConversationalRetrievalChain to accept data as JSON.

我修改了这个源代码的数据加载器https://github.com/techleadhd/chatgpt-retrieval，以便ConversationalRetrievalChain可以接受JSON格式的数据。

I created a dummy JSON file and according to the LangChain documentation, it fits JSON structure as described in the document.

我创建了一个虚拟的JSON文件，并根据LangChain文档，它符合文档中描述的JSON结构。

The code is :

以下是代码：

import os
import sys

import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader

os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY_HERE'

Enable to save to disk & reuse the model (for repeated queries on the same data)

PERSIST = False

query = None
if len(sys.argv) > 1:
query = sys.argv1

if PERSIST and os.path.exists("persist"):
print("Reusing index...\n")
vectorstore = Chroma(persist_directory="persist", embedding_function=OpenAIEmbeddings())
index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:

loader = JSONLoader("data/review.json", jq_schema=".reviews[]", content_key='text') # Use this line if you only need data.json

if PERSIST:
index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"persist"}).from_loaders([loader])
else:
index = VectorstoreIndexCreator().from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
retriever=index.vectorstore.as_retriever()
)

chat_history = []
while True:
if not query:
query = input("Prompt: ")
if query in ['quit', 'q', 'exit']:
sys.exit()
result = chain({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))
query = None

Some examples of results are:

以下是一些结果示例：

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn't meet their expectations in terms of quality.

可以！根据提供的反馈，我们对这些酒店有不同的意见。有人觉得这是一家普通的酒店，没有什么特别之处；另一人在这里享受了卓越的服务和舒适的客房；还有一人被一家酒店愉快地惊艳到，因为这家酒店超出了他们的预期，客房宽敞而干净；最后，还有人对一家价格过高的酒店有失望，这家酒店没有达到他们对质量的期望。

Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.

数据中有四个反馈。

Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.

数据中有四个积极的反馈。

Prompt: how many of them are negative?
There are three negative feedbacks present in the data.

数据中有三个消极的反馈。

Prompt: how many of them are neutral?
Two of the feedbacks are neutral.

有两个反馈是中立的。

Prompt: what is the last review you can see?
The most recent review I can see is: "The hotel exceeded my expectations. The room was spacious and clean."

我能看到的最新评论是："这家酒店超出了我的期望。客房宽敞而干净。"

Prompt: what is the first review you can see?
The first review I can see is "Highly recommended! The hotel has a beautiful view and the staff is friendly."

我能看到的第一条评论是"强烈推荐！这家酒店有美丽的景色，员工友好。"

Prompt: how many total texts are in the JSON file?
I don't know the answer.

我不知道答案。

I can chat with my data but except for the first answer, all other answers are wrong.

我可以与我的数据交流，但除了第一个答案之外，所有其他答案都是错误的。

Is there a problem with JSONloader or jq_scheme? How can I adapt the code so that I can generate the expected output?

JSONloader或jq_scheme是否存在问题？我应该如何调整代码以生成预期的输出？

(Note: The code you provided appears to be in Python, so I've translated the comments and variable names accordingly.)

英文:

I modified the data loader of this source code <https://github.com/techleadhd/chatgpt-retrieval> for ConversationalRetrievalChain to accept data as JSON.

I created a dummy JSON file and according to the LangChain documentation, it fits JSON structure as described in the document.

{
  &quot;reviews&quot;: [
    {&quot;text&quot;: &quot;Great hotel, excellent service and comfortable rooms.&quot;},
    {&quot;text&quot;: &quot;I had a terrible experience at this hotel. The room was dirty and the staff was rude.&quot;},
    {&quot;text&quot;: &quot;Highly recommended! The hotel has a beautiful view and the staff is friendly.&quot;},
    {&quot;text&quot;: &quot;Average hotel. The room was okay, but nothing special.&quot;},
    {&quot;text&quot;: &quot;I absolutely loved my stay at this hotel. The amenities were top-notch.&quot;},
    {&quot;text&quot;: &quot;Disappointing experience. The hotel was overpriced for the quality provided.&quot;},
    {&quot;text&quot;: &quot;The hotel exceeded my expectations. The room was spacious and clean.&quot;},
    {&quot;text&quot;: &quot;Avoid this hotel at all costs! The customer service was horrendous.&quot;},
    {&quot;text&quot;: &quot;Fantastic hotel with a great location. I would definitely stay here again.&quot;},
    {&quot;text&quot;: &quot;Not a bad hotel, but there are better options available in the area.&quot;}
  ]
}

The code is :

import os
import sys
import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader
os.environ[&quot;OPENAI_API_KEY&quot;] = &#39;YOUR_API_KEY_HERE&#39;
# Enable to save to disk &amp; reuse the model (for repeated queries on the same data)
PERSIST = False
query = None
if len(sys.argv) &gt; 1:
  query = sys.argv[1]
if PERSIST and os.path.exists(&quot;persist&quot;):
  print(&quot;Reusing index...\n&quot;)
  vectorstore = Chroma(persist_directory=&quot;persist&quot;, embedding_function=OpenAIEmbeddings())
  index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:
  loader = JSONLoader(&quot;data/review.json&quot;, jq_schema=&quot;.reviews[]&quot;, content_key=&#39;text&#39;) # Use this line if you only need data.json
  if PERSIST:
    index = VectorstoreIndexCreator(vectorstore_kwargs={&quot;persist_directory&quot;:&quot;persist&quot;}).from_loaders([loader])
  else:
    index = VectorstoreIndexCreator().from_loaders([loader])
chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model=&quot;gpt-3.5-turbo&quot;),
  retriever=index.vectorstore.as_retriever()
)
chat_history = []
while True:
  if not query:
    query = input(&quot;Prompt: &quot;)
  if query in [&#39;quit&#39;, &#39;q&#39;, &#39;exit&#39;]:
    sys.exit()
  result = chain({&quot;question&quot;: query, &quot;chat_history&quot;: chat_history})
  print(result[&#39;answer&#39;])
  chat_history.append((query, result[&#39;answer&#39;]))
  query = None

Some examples of results are:

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn&#39;t meet their expectations in terms of quality.
Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.
Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.
Prompt: how many of them are negative?
There are three negative feedbacks present in the data.
Prompt: how many of them are neutral?
Two of the feedbacks are neutral.
Prompt: what is the last review you can see?
The most recent review I can see is: &quot;The hotel exceeded my expectations. The room was spacious and clean.&quot;
Prompt: what is the first review you can see?
The first review I can see is &quot;Highly recommended! The hotel has a beautiful view and the staff is friendly.&quot;
Prompt: how many total texts are in the JSON file?
I don&#39;t know the answer.

I can chat with my data but except for the first answer, all other answers are wrong.

Is there a problem with JSONloader or jq_scheme? How can I adapt the code so that I can generate the expected output?

答案1

得分: 3

在ConversationalRetrievalChain中，搜索设置默认为4，请参考../langchain/chains/conversational_retrieval/base.py中的top_k_docs_for_context: int = 4。

这是有道理的，因为你不希望将所有向量都发送到LLM模型（这也会增加成本）。根据用例，你可以将默认值更改为更容易管理的值，使用以下代码：

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model="gpt-3.5-turbo"),
  retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10})
)

通过这个更改，你将获得以下结果：

{'question': '数据中有多少份反馈？',
 'chat_history': [],
 'answer': '数据中有10份反馈。'}

英文:

In ConversationalRetrievalChain , search is setup to default 4, refer top_k_docs_for_context: int = 4 in ../langchain/chains/conversational_retrieval/base.py .

That makes sense as you don't want to send all the vectors to LLM model(associated cost too). Based on the usecase, you can change the default to more manageable, using the following:

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model=&quot;gpt-3.5-turbo&quot;),
  retriever=index.vectorstore.as_retriever(search_kwargs={&quot;k&quot;: 10})
)

with this change, you will get the result

{&#39;question&#39;: &#39;how many feedbacks present in the data ?&#39;,
 &#39;chat_history&#39;: [],
 &#39;answer&#39;: &#39;There are 10 pieces of feedback present in the data.&#39;}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

LangChain ConversationalRetrieval与JSONloader

问题

Enable to save to disk & reuse the model (for repeated queries on the same data)

答案1

keras.losses.sparse_categorical_crossentropy的实现是怎样的？

无法确定 sys.argv 的模块路径错误 – 每个 Python 程序都返回此错误。

在Python中搜索两个相同的字符并对它们进行筛选。

只通过调试Selenium提供数据。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。