Python ProcessPoolExecutor 问题

huangapple go评论85阅读模式
英文:

Python ProcessPoolExecutor issue

问题

我正在尝试使用tesseract处理文件夹中的PDF文件。在同步执行文件时,似乎该函数运行正常,但当我尝试使用processpoolexecutor来实现时,什么都没有发生。真的希望能得到一些帮助/见解。

系统详细信息:

  • Ubuntu 22.04,i5,Python 3.10.6(venv),VS Code

代码:

  1. # 从PDF文件中提取文本并将内容存储在文件中以供自然语言处理分析
  2. # 尝试使用camelot和tabular两个包都无法提取所需的表格内容
  3. # 该脚本使用tesseract执行光学字符识别(OCR)
  4. from glob import glob
  5. import pytesseract
  6. from concurrent.futures import ProcessPoolExecutor
  7. from concurrent.futures import as_completed
  8. from pdf2image import convert_from_path as pdf2img
  9. import pathlib as pl
  10. import multiprocessing as mpc
  11. def ProcessPDF(par_FilePath):
  12. lstImages = pdf2img(par_FilePath)
  13. intImgs = len(lstImages)
  14. strOCRd = ''
  15. for it, im in enumerate(lstImages):
  16. npg = '='*50 + f'Pg:{it+1}' + '='*50 + '\n' # 结束每一页
  17. pgText = pytesseract.image_to_string(im) # 执行OCR
  18. strOCRd += pgText + '\n' + npg # 添加到字符串
  19. print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
  20. fStem = pl.Path(par_FilePath).stem
  21. fDir = str(pl.Path(par_FilePath).parent) + '/'
  22. with open(fDir + fStem + '.txt', 'w') as fobj: # 保存文件
  23. fobj.write(strOCRd)
  24. return f'Completed: {pl.Path(par_FilePath).name}'
  25. if __name__ == '__main__':
  26. strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
  27. lstFiles = glob(strFolderPDF + '*.pdf')
  28. numFiles = len(lstFiles)
  29. numCPUs = mpc.cpu_count()
  30. print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
  31. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  32. ftrResults = ppe.map(ProcessPDF, lstFiles)
  33. for ftrResult in as_completed(ftrResults):
  34. print(ftrResult)
  35. # 这段代码是有效的
  36. # for ipath in lstFiles:
  37. # ProcessPDF(ipath)

当前结果:

  • 在调试器中运行时,我可以看到调用堆栈中的每个线程启动,但然后什么都不发生。控制台没有活动。系统监视器显示我的4个CPU都达到了100%。我尝试设置1和2个工作线程,但没有成功。在中断之前等待了10分钟。

预期结果:

  • 应该看到4个进程启动并在控制台中打印出语句,函数将处理每个PDF的每一页。
英文:

I'm trying to process PDF files in a folder using tesseract. Seems that the function works fine when executed on files synchronously, but when i try and implement processpoolexecutor, nothing happens. Would truly appreciate some help/insight.

System details:

  • Ubuntu 22.04, i5, python 3.10.6 (venv), vscode

Code:

  1. # scrap text from pdf's and store content in files for nlp analysis
  2. # tried to use both camelot and tabular and both packages could not scrap the required table contents
  3. # this script implements ocr using tesseract
  4. from glob import glob
  5. import pytesseract
  6. from concurrent.futures import ProcessPoolExecutor
  7. from concurrent.futures import as_completed
  8. from pdf2image import convert_from_path as pdf2img
  9. import pathlib as pl
  10. import multiprocessing as mpc
  11. def ProcessPDF(par_FilePath):
  12. lstImages = pdf2img(par_FilePath)
  13. intImgs = len(lstImages)
  14. strOCRd = ''
  15. for it, im in enumerate(lstImages):
  16. npg = '='*50+f'Pg:{it+1}'+'='*50+'\n' #end each page
  17. pgText = pytesseract.image_to_string(im) #perform ocr
  18. strOCRd += pgText + '\n' + npg # add to string
  19. print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
  20. fStem = pl.Path(par_FilePath).stem
  21. fDir = str(pl.Path(par_FilePath).parent)+'/'
  22. with open(fDir + fStem + '.txt', 'w') as fobj: #save file
  23. fobj.write(strOCRd)
  24. return f'Completed: {pl.Path(par_FilePath).name}'
  25. if __name__ == '__main__':
  26. strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
  27. lstFiles = glob(strFolderPDF+'*.pdf')
  28. numFiles = len(lstFiles)
  29. numCPUs = mpc.cpu_count()
  30. print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
  31. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  32. ftrResults = ppe.map(ProcessPDF, lstFiles)
  33. for ftrResult in as_completed(ftrResults):
  34. print(ftrResult)
  35. #this works
  36. #for ipath in lstFiles:
  37. # ProcessPDF(ipath)

Current Outcome:

  • When run in the debugger i can see each of the threads spin up in the call stack, but then nothing. No activity the console. The system monitor shows each of my 4 cpu's hit 100%. I experimented by setting 1, and 2 workers, but no success. Waited for 10 minutes before hitting interrupt.
    Expected Outcome:
  • Should see 4 processes kick off and print statements in the console as the function works through each page of each pdf.

答案1

得分: 1

ProcessPoolExecutor的map函数不会返回future对象

将:

  1. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  2. ftrResults = ppe.map(ProcessPDF, lstFiles)
  3. for ftrResult in as_completed(ftrResults):
  4. print(ftrResult)

更改为:

  1. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  2. ftrResults = ppe.map(ProcessPDF, lstFiles)
  3. for ftrResult in ftrResults:
  4. print(ftrResult)

您当前的代码将引发AttributeError异常。

英文:

ProcessPoolExecutor's map function does not return future objects

Change:

  1. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  2. ftrResults = ppe.map(ProcessPDF, lstFiles)
  3. for ftrResult in as_completed(ftrResults):
  4. print(ftrResult)

...to...

  1. with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
  2. ftrResults = ppe.map(ProcessPDF, lstFiles)
  3. for ftrResult in ftrResults:
  4. print(ftrResult)

Your current code will induce an AttributeError exception

huangapple
  • 本文由 发表于 2023年3月12日 14:29:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75711427.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定