使用Deepset Haystack从西里尔语言生成问答(Q&A)。

huangapple go评论102阅读模式
英文:

Generating Q&As from Cyrillic languages with Deepset Haystack

问题

我正在尝试根据上传的文本生成问题和答案。

我正在使用开源库Deepset的Haystack来实现这个目标,因为它在处理英文文本方面效果很好。

然而,对于像俄语这样的西里尔文本,生成的问题中出现了切割的单词。

我使用了俄语SberQUAD数据集来训练模型。然后,我尝试从亚历山大·普希金的诗《Ruslan和Ludmila》中生成问答。

答案似乎主要还好,但问题是音节的混合。

这是我的代码:

converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)

docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir) 

qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []

for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * 为文档 {idx} 生成问题和答案:{document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    output_data.append(result)
    print_questions(result)
    print("---")

输出只是一些混合的俄语音节和英文单词:

生成的问题和答案对:
 - 问题: 什么是 какат а доер мое?
      答案: свои
      答案: дети
      答案: согласен Скакать за дочерь
 - 问题: 什么 удет не нарасен?
      答案: подвиг
      答案: моих
      答案: княжной
 - 问题: 什么是 орестн ени?
      答案: княжной
      答案: княжной
      答案: полцарством
 - 问题: 什么是 оскликнули свои седлаем?
      答案: Сейчас коней
      答案: жены
      答案: «Я!» — молвил горестный жених. «Я! я!» — воскликнули с Рогдаем Фарла
 - 问题: 什么 рад вес иедит мир。
      答案: «Сейчас коней своих седлаем; Мы рады
      答案: ъ
      答案: лцарством прадедов моих
英文:

I'm trying to generate questions and answers based on an uploaded text.

I'm using opensource library Haystack by Deepset in order to do that as it works great with English texts.

However, with Cyrillic texts like Russian I get chopped words in the generated questions.

I train the model with Russian SberQUAD dataset. And then I'm trying to generate Q&As from Ruslan and Ludmila poem by Alexander Pushkin.

The answers seems OK mainly, but the questions are really mix of syllables.

Here is my code:

converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)

docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir) 

qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []

for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    output_data.append(result)
    print_questions(result)
    print("---")

The output is just some mix of Russian syllables and English words:

Generated pairs:
 - Q: What is какат а доер мое?
      A: свои
      A: дети
      A: согласен Скакать за дочерь
 - Q: What удет не нарасен?
      A: подвиг
      A: моих
      A: княжной
 - Q: What was орестн ени?
      A: княжной
      A: княжной
      A: полцарством
 - Q: What did оскликнули свои седлаем?
      A: Сейчас коней
      A: жены
      A: «Я!» — молвил горестный жених. «Я! я!» — воскликнули с Рогдаем Фарла
 - Q: рад вес иедит мир.
      A: «Сейчас коней своих седлаем; Мы рады
      A: ъ
      A: лцарством прадедов моих

答案1

得分: 2

QuestionGenerator 使用 valhalla/t5-base-e2e-qg 作为默认模型。

由于您正在使用 FARMReadercointegrated/rubert-tiny,因此您必须为 QuestionGenerator 使用兼容的模型。在这种情况下,兼容性仅涉及模型的语言。

question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
英文:

QuestionGenerator uses valhalla/t5-base-e2e-qg as default model.

Since you're using FARMReader with cointegrated/rubert-tiny you must use a compatible model for QuestionGenerator. Compatibility in this case is only in terms of language of the model.

question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)

huangapple
  • 本文由 发表于 2023年7月10日 19:53:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76653471.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定