2023年3月12日 14:29:51go评论85阅读模式

英文:

Python ProcessPoolExecutor issue

问题

我正在尝试使用tesseract处理文件夹中的PDF文件。在同步执行文件时，似乎该函数运行正常，但当我尝试使用processpoolexecutor来实现时，什么都没有发生。真的希望能得到一些帮助/见解。

系统详细信息：

Ubuntu 22.04，i5，Python 3.10.6（venv），VS Code

代码：

# 从PDF文件中提取文本并将内容存储在文件中以供自然语言处理分析
# 尝试使用camelot和tabular两个包都无法提取所需的表格内容
# 该脚本使用tesseract执行光学字符识别（OCR）
from glob import glob 
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc
def ProcessPDF(par_FilePath):
    lstImages = pdf2img(par_FilePath)
    intImgs = len(lstImages)
    strOCRd = ''
    for it, im in enumerate(lstImages):
        npg = '='*50 + f'Pg:{it+1}' + '='*50 + '\n'  # 结束每一页
        pgText = pytesseract.image_to_string(im)  # 执行OCR
        strOCRd += pgText + '\n' + npg  # 添加到字符串
        print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
    fStem  = pl.Path(par_FilePath).stem
    fDir =  str(pl.Path(par_FilePath).parent) + '/'
    
    with open(fDir + fStem + '.txt', 'w') as fobj:  # 保存文件
        fobj.write(strOCRd)
    return f'Completed: {pl.Path(par_FilePath).name}'
if __name__ == '__main__':
    strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
    lstFiles = glob(strFolderPDF + '*.pdf')
    numFiles = len(lstFiles)
    numCPUs = mpc.cpu_count()
    print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
    with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
        ftrResults = ppe.map(ProcessPDF, lstFiles)
    
        for ftrResult in as_completed(ftrResults):
            print(ftrResult)
    # 这段代码是有效的
    # for ipath in lstFiles:
    #    ProcessPDF(ipath)

当前结果：

在调试器中运行时，我可以看到调用堆栈中的每个线程启动，但然后什么都不发生。控制台没有活动。系统监视器显示我的4个CPU都达到了100%。我尝试设置1和2个工作线程，但没有成功。在中断之前等待了10分钟。

预期结果：

应该看到4个进程启动并在控制台中打印出语句，函数将处理每个PDF的每一页。

英文:

I'm trying to process PDF files in a folder using tesseract. Seems that the function works fine when executed on files synchronously, but when i try and implement processpoolexecutor, nothing happens. Would truly appreciate some help/insight.

System details:

Ubuntu 22.04, i5, python 3.10.6 (venv), vscode

Code:

# scrap text from pdf&#39;s and store content in files for nlp analysis
# tried to use both camelot and tabular and both packages could not scrap the required table contents
# this script implements ocr using tesseract  
from glob import glob 
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc
def ProcessPDF(par_FilePath):
lstImages = pdf2img(par_FilePath)
intImgs = len(lstImages)
strOCRd = &#39;&#39;
for it, im in enumerate(lstImages):
npg = &#39;=&#39;*50+f&#39;Pg:{it+1}&#39;+&#39;=&#39;*50+&#39;\n&#39; #end each page
pgText = pytesseract.image_to_string(im) #perform ocr
strOCRd += pgText + &#39;\n&#39; + npg # add to string
print(f&#39;Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%&#39;)
fStem  = pl.Path(par_FilePath).stem
fDir =  str(pl.Path(par_FilePath).parent)+&#39;/&#39;
with open(fDir + fStem + &#39;.txt&#39;, &#39;w&#39;) as fobj: #save file
fobj.write(strOCRd)
return f&#39;Completed: {pl.Path(par_FilePath).name}&#39;
if __name__ == &#39;__main__&#39;:
strFolderPDF = r&#39;/home/*****/proj/rfp_model/pdfFiles/&#39;
lstFiles = glob(strFolderPDF+&#39;*.pdf&#39;)
numFiles = len(lstFiles)
numCPUs = mpc.cpu_count()
print(f&#39;Starting pool executor, processing {numFiles} files with {numCPUs} workers.&#39;)
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
#this works 
#for ipath in lstFiles:
#    ProcessPDF(ipath)

Current Outcome:

When run in the debugger i can see each of the threads spin up in the call stack, but then nothing. No activity the console. The system monitor shows each of my 4 cpu's hit 100%. I experimented by setting 1, and 2 workers, but no success. Waited for 10 minutes before hitting interrupt.
Expected Outcome:
Should see 4 processes kick off and print statements in the console as the function works through each page of each pdf.

答案1

得分: 1

ProcessPoolExecutor的map函数不会返回future对象

将：

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
    ftrResults = ppe.map(ProcessPDF, lstFiles)
    for ftrResult in as_completed(ftrResults):
        print(ftrResult)

更改为：

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
    ftrResults = ppe.map(ProcessPDF, lstFiles)
    for ftrResult in ftrResults:
        print(ftrResult)

您当前的代码将引发AttributeError异常。

英文:

ProcessPoolExecutor's map function does not return future objects

Change:

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)

...to...

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in ftrResults:
print(ftrResult)

Your current code will induce an AttributeError exception

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python ProcessPoolExecutor 问题

问题

答案1

使用pydot与networkx

Non UTF-8兼容字符”\x{0D}”在输出CSV行末尾。

TFIDFVectorizer 制作拼接的单词标记

如何在PyTorch中创建类似于Tensorflow顺序模型的等效模型？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。