2023年7月10日 19:53:28go评论122阅读模式

英文:

Generating Q&As from Cyrillic languages with Deepset Haystack

问题

我正在尝试根据上传的文本生成问题和答案。

我正在使用开源库Deepset的Haystack来实现这个目标，因为它在处理英文文本方面效果很好。

然而，对于像俄语这样的西里尔文本，生成的问题中出现了切割的单词。

我使用了俄语SberQUAD数据集来训练模型。然后，我尝试从亚历山大·普希金的诗《Ruslan和Ludmila》中生成问答。

答案似乎主要还好，但问题是音节的混合。

这是我的代码：

converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)
docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir) 
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []
for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * 为文档 {idx} 生成问题和答案：{document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    output_data.append(result)
    print_questions(result)
    print("---")

输出只是一些混合的俄语音节和英文单词：

生成的问题和答案对：
 - 问题: 什么是 какат а доер мое?
      答案: свои
      答案: дети
      答案: согласен Скакать за дочерь
 - 问题: 什么 удет не нарасен?
      答案: подвиг
      答案: моих
      答案: княжной
 - 问题: 什么是 орестн ени?
      答案: княжной
      答案: княжной
      答案: полцарством
 - 问题: 什么是 оскликнули свои седлаем?
      答案: Сейчас коней
      答案: жены
      答案: &#171;Я!&#187; — молвил горестный жених. &#171;Я! я!&#187; — воскликнули с Рогдаем Фарла
 - 问题: 什么 рад вес иедит мир。
      答案: &#171;Сейчас коней своих седлаем; Мы рады
      答案: ъ
      答案: лцарством прадедов моих

英文:

I'm trying to generate questions and answers based on an uploaded text.

I'm using opensource library Haystack by Deepset in order to do that as it works great with English texts.

However, with Cyrillic texts like Russian I get chopped words in the generated questions.

I train the model with Russian SberQUAD dataset. And then I'm trying to generate Q&As from Ruslan and Ludmila poem by Alexander Pushkin.

The answers seems OK mainly, but the questions are really mix of syllables.

Here is my code:

converter = TextConverter(remove_numeric_tables=False, valid_languages=[&quot;ru&quot;])
doc = converter.convert(file_path=&#39;pushkins.txt&#39;, meta=None)[0]
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by=&quot;word&quot;,
    split_length=100,
    split_respect_sentence_boundary=True,
)
docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path=&#39;cointegrated/rubert-tiny&#39;, use_gpu=True)
reader.train(data_dir=data_dir, train_filename=&quot;train-v1.1.json&quot;, dev_filename=&quot;dev-v1.1.json&quot;, use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir) 
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []
for idx, document in enumerate(tqdm(document_store)):
    print(f&quot;\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n&quot;)
    result = qag_pipeline.run(documents=[document])
    output_data.append(result)
    print_questions(result)
    print(&quot;---&quot;)

The output is just some mix of Russian syllables and English words:

Generated pairs:
 - Q: What is какат а доер мое?
      A: свои
      A: дети
      A: согласен Скакать за дочерь
 - Q: What удет не нарасен?
      A: подвиг
      A: моих
      A: княжной
 - Q: What was орестн ени?
      A: княжной
      A: княжной
      A: полцарством
 - Q: What did оскликнули свои седлаем?
      A: Сейчас коней
      A: жены
      A: &#171;Я!&#187; — молвил горестный жених. &#171;Я! я!&#187; — воскликнули с Рогдаем Фарла
 - Q: рад вес иедит мир.
      A: &#171;Сейчас коней своих седлаем; Мы рады
      A: ъ
      A: лцарством прадедов моих

答案1

得分: 2

QuestionGenerator 使用 valhalla/t5-base-e2e-qg 作为默认模型。

由于您正在使用 FARMReader 与 cointegrated/rubert-tiny，因此您必须为 QuestionGenerator 使用兼容的模型。在这种情况下，兼容性仅涉及模型的语言。

question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)

英文:

QuestionGenerator uses valhalla/t5-base-e2e-qg as default model.

Since you're using FARMReader with cointegrated/rubert-tiny you must use a compatible model for QuestionGenerator. Compatibility in this case is only in terms of language of the model.

question_generator = QuestionGenerator(model_name_or_path=&#39;nbroad/mt5-base-qgen&#39;)
reader = FARMReader(model_name_or_path=&#39;cointegrated/rubert-tiny&#39;, use_gpu=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Deepset Haystack从西里尔语言生成问答（Q&A）。

问题

答案1

Lora fine-tuning taking too long

F1-Score和准确率用于文本相似度。

如何使用T5模型的输出替换输入序列中的掩码标记。

使用句子转换器的评估器

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。