英文:
Generating Q&As from Cyrillic languages with Deepset Haystack
问题
我正在尝试根据上传的文本生成问题和答案。
我正在使用开源库Deepset的Haystack来实现这个目标,因为它在处理英文文本方面效果很好。
然而,对于像俄语这样的西里尔文本,生成的问题中出现了切割的单词。
我使用了俄语SberQUAD数据集来训练模型。然后,我尝试从亚历山大·普希金的诗《Ruslan和Ludmila》中生成问答。
答案似乎主要还好,但问题是音节的混合。
这是我的代码:
converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir)
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []
for idx, document in enumerate(tqdm(document_store)):
print(f"\n * 为文档 {idx} 生成问题和答案:{document.content[:100]}...\n")
result = qag_pipeline.run(documents=[document])
output_data.append(result)
print_questions(result)
print("---")
输出只是一些混合的俄语音节和英文单词:
生成的问题和答案对:
- 问题: 什么是 какат а доер мое?
答案: свои
答案: дети
答案: согласен Скакать за дочерь
- 问题: 什么 удет не нарасен?
答案: подвиг
答案: моих
答案: княжной
- 问题: 什么是 орестн ени?
答案: княжной
答案: княжной
答案: полцарством
- 问题: 什么是 оскликнули свои седлаем?
答案: Сейчас коней
答案: жены
答案: «Я!» — молвил горестный жених. «Я! я!» — воскликнули с Рогдаем Фарла
- 问题: 什么 рад вес иедит мир。
答案: «Сейчас коней своих седлаем; Мы рады
答案: ъ
答案: лцарством прадедов моих
英文:
I'm trying to generate questions and answers based on an uploaded text.
I'm using opensource library Haystack by Deepset in order to do that as it works great with English texts.
However, with Cyrillic texts like Russian I get chopped words in the generated questions.
I train the model with Russian SberQUAD dataset. And then I'm trying to generate Q&As from Ruslan and Ludmila
poem by Alexander Pushkin.
The answers seems OK mainly, but the questions are really mix of syllables.
Here is my code:
converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir)
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []
for idx, document in enumerate(tqdm(document_store)):
print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
result = qag_pipeline.run(documents=[document])
output_data.append(result)
print_questions(result)
print("---")
The output is just some mix of Russian syllables and English words:
Generated pairs:
- Q: What is какат а доер мое?
A: свои
A: дети
A: согласен Скакать за дочерь
- Q: What удет не нарасен?
A: подвиг
A: моих
A: княжной
- Q: What was орестн ени?
A: княжной
A: княжной
A: полцарством
- Q: What did оскликнули свои седлаем?
A: Сейчас коней
A: жены
A: «Я!» — молвил горестный жених. «Я! я!» — воскликнули с Рогдаем Фарла
- Q: рад вес иедит мир.
A: «Сейчас коней своих седлаем; Мы рады
A: ъ
A: лцарством прадедов моих
答案1
得分: 2
QuestionGenerator
使用 valhalla/t5-base-e2e-qg
作为默认模型。
由于您正在使用 FARMReader
与 cointegrated/rubert-tiny
,因此您必须为 QuestionGenerator
使用兼容的模型。在这种情况下,兼容性仅涉及模型的语言。
question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
英文:
QuestionGenerator
uses valhalla/t5-base-e2e-qg
as default model.
Since you're using FARMReader
with cointegrated/rubert-tiny
you must use a compatible model for QuestionGenerator
. Compatibility in this case is only in terms of language of the model.
question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论