英文:
Extremely slow POST times
问题
I've translated the content you provided, excluding the code part. Here's the translated text:
"I'm working on a 'PDF manager' that takes in a PDF file and parses its sentence and I'm storing the PDFs in object storage and the sentences in a DB.
My problem is that the POST times are taking extremely long. This was not an issue before and even with version control I could not find what exactly is causing the issue.
I'm using Azure SQL DB for the database and Azure object storage. I initially thought the DB was the issue but I tried running the same thing on a local DB and got a connection timeout from how long it was taking.
The API's method...
There are two methods here which I initially thought are causing the issue. The first one parses the sentences and places them into a text file, which is then stored in object storage. But I made that optional through the 'WithTxtFile' boolean, and even when that's false, the issue persists. The other thing was the 'WithImages' which implements OCR on the images in the PDF. But again, even when that is turned off through the 'WithImages' boolean, the issue persists.
I'm sure it's not due to the sentence parsing logic since I haven't changed it since starting development, and the issue only started recently.
I also think that the code reaches 'await DB.SaveChangesAsync()' and that a new row is actually created considering I'm getting the following message in the terminal...
Which - from my understanding - indicates that a row has been created and inputted with the required values. But I'm not sure of this point, I could be wrong.
Moreover, the PDFs I'm inputting are rather small. One is 950kb and contains ~5000 words. Even this takes a while when using Azure storage and reaches DB timeout when using a local DB. The only file where the process finishes successfully is a 3kb file that contains nothing but basic text and no images.
Any help would be extremely appreciated!"
Let me know if you need further assistance or have any questions related to this translation.
英文:
I'm working on a "PDF manager" that takes in a PDF file and parses its sentence and I'm storing the PDFs in object storage and the sentences in a DB.
My problem is that the POST times are taking extremely long. This was not an issue before and even with version control I could not find what exactly is causing the issue.
I'm using Azure SQL DB for the database and Azure object storage. I initially thought the DB was the issue but I tried running the same thing on a local DB and got a connection timeout from how long it was taking.
The API's method
//(POST) API to post a PDF to the database and parse its sentences
[HttpPost]
[Authorize]
public async Task<ActionResult<List<PDF>>> PostPDF(List<IFormFile> Files, bool? WithImages, bool? WithTxtFile)
{
List<PDF> ListPDF = new List<PDF>();
bool _WithImages = WithImages ?? false;
bool _WithTxtFile = WithTxtFile ?? false;
foreach(var file in Files)
{
if(System.IO.Path.GetExtension(file.FileName) == ".pdf") //Ensure the file being sent is a PDF
{
var FileInBytes = ManipulatorPDF.LoadBytePDF(file); //Byte array representing the PDF
PdfDocument FileLoader = ManipulatorPDF.LoadPDF(file);
List<Sentences> Sentences = ManipulatorPDF.GetSentences(FileLoader, _WithImages);
//Create a new instance of a PDF with all the required parameters.
//Save the file instance and sentences in an Azure blob storage using AzureSerices.SaveFile()
string SentencesInTxt = (_WithTxtFile) ? await AzureServices.SaveFile(ManipulatorPDF.SentencesToText(Sentences), file.FileName.Substring(0, file.FileName.Length - 4)+"_Sentence.txt", "sentences-container")
: "This file was not posted with 'WithTxtFile' enabled";
PDF FileInstance = new PDF(file.FileName, file.Length, FileLoader.Pages.Count, Sentences, await AzureServices.SaveFile(FileInBytes, file.FileName, "pdf-container"), SentencesInTxt);
DB.Add(FileInstance); //Add an instance of the PDF class to the database
ListPDF.Add(FileInstance); //Add instnace to a list
}
else
{
return BadRequest("Bad request: Only PDFs are accepted. File(s) sent is not a PDF"); //Return bad request message if a non-PDF is sent
}
}
Cache.Remove("ListPDF"); //Unload the cache after a new post
await DB.SaveChangesAsync();
return ListPDF; //Return the list of PDFs representing a successful POST
}
There are two methods here which I initally thought are causing the issue. The first is one that parses the sentences and places them into a text file which is then stored in object storage. But I made that optional through the WIthTxtFile boolean and even when that false the issue persists. The other thing was the With Images which implements OCR on the images in the PDF. But again even when that is turned off through the WithImages boolean that issues persists.
I'm sure it's not due to the sentence parsing logic since I haven't changed it since starting development, and the issue only started recently.
I also think that the code reaches await DB.SaveChangesAsync()
and that a new row is actually created considering I'm getting the following message in terminal:
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
Executed DbCommand (58ms) [Parameters=[@p0='?' (Size = 4000), @p1='?' (DbType = Double), @p2='?' (Size = 4000), @p3='?' (DbType = Int32), @p4='?' (Size = 4000), @p5='?' (DbType = DateTime2)], CommandType='Text', CommandTimeout='30']
SET IMPLICIT_TRANSACTIONS OFF;
SET NOCOUNT ON;
INSERT INTO [PDFs] ([FileLink], [FileSize], [Name], [NumberOfPages], [SentencesLinkTxt], [TimeOfUpload])
OUTPUT INSERTED.[id]
Which - from my understanding - indicates that a row has been created and inputted with the required values. But I'm not sure of this point, I could be wrong.
Moreover, the PDFs I'm inputting are rather small. One is 950kb and contains ~5000 words. Even this takes a while when using Azure storage and reaches DB timeout when using a local DB. The only file where the process finishes successfully is a 3kb file that contains nothing but basic text and no images.
Any help would be extremely appreciated!
答案1
得分: 1
解析文件中的文本和图像不是我会建议在请求范围内执行的操作。虽然您已经设置了标志以潜在地跳过这些操作,但仍然会调用像ManipulatorPDF.GetSentences
这样的代码,即使您将标志设置为false。
在像基于上传的文件解析这样的操作时,最好将其视为更异步的后台处理过程。用户上传文件,将上传的文件保存在临时存储中,并创建一个唯一的标识符键立即发送回客户端。然后,将作业添加到队列中,以供后台工作者拾取,用于处理上传。Web客户端将获取该密钥以及当前状态,例如“已提交”或“待处理”,并可以定期使用该密钥向服务器发出另一个请求以获取当前状态。当后台作业从队列中处理并到达该请求时,当前状态可以更新为“处理中”,然后最终更新为“完成”或“错误”,并提供可用的详细信息。
对于这种重型作业的问题不仅仅是考虑到单个请求的时间内可能无法完成作业,这取决于上传文件的大小和复杂性,还要考虑到几位用户可能在大致相同的时间内启动此请求。这可能会迅速引发与内存使用、CPU处理或用于解析PDF的库的许可限制等各种问题。
英文:
Parsing a file for text and images is not something I would recommend doing within the scope of a request. While you have flags set to potentially skip these, you are still calling code like ManipulatorPDF.GetSentences
even if you have the flags set to false.
When doing something like a file parse based on an upload, this should ideally be treated as a more asynchronous background process. The user uploads a file which saves the uploaded file to storage in a temporary capacity and creates a unique identifier key to send back to the client immediately. A job is then added to a queue to be picked up by a background worker which goes through submitted jobs to process the uploads. The Web client gets back that key along with a current status like "Submitted" or "Pending Processing" and can make another request to the server using that key periodically to get a current status. As the background job processes through the queue and gets to that request, the current status can be updated to "Processing", then finally "Complete" or an "Error" with available details.
The issue with heavyweight jobs like this is not just considering the fact that the job might not be completable within the time of a single request depending on the size and complexity of the file being uploaded, but consider that several users could potentially kick off this request all around the same time. This can quickly cause various issues with memory use, CPU processing, or licensing limitations with libraries used to parse the PDF to name a few.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论