如何加速在.NET Core 6中读取PDF文件的循环。

huangapple go评论80阅读模式
英文:

How to speed up this loop reading PDF files in .Net Core 6

问题

我有这个方法 SearchPdf(string path, string keyword),其中 path 是包含要搜索的所有PDF文件的文件夹路径,keyword 是要在PDF文件或PDF文件名中搜索的关键字。
我正在使用 Spire.Pdf 读取PDF文件。

以下是该方法:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
    var results = new ConcurrentBag<KeyValuePair<string, string>>();

    var directory = new DirectoryInfo(path);
    var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);

    Parallel.ForEach(files, file =>
    {
        // 打开PDF文件
        var document = new PdfDocument(file.FullName);
        Console.WriteLine("\n\r在文件中搜索关键字: " + keyword + ",文件名: " + file.Name + "\n\r");

        // 遍历文档的页面
        for (int i = 0; i < document.Pages.Count; i++)
        {
            // 提取页面文本
            var page = document.Pages[i];
            var text = page.ExtractText();

            // 搜索关键字
            keyword = keyword.ToLower().Trim();
            if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
            {
                results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
            }
        }
    });

    return results;
}

一切正常,但当我有超过200个要搜索的关键字和超过1500个文件时,速度有点慢。有什么方法可以优化这个循环吗?

英文:

I have this method SearchPdf(string path, string keyword) where path is the folder path that contains all the PDFs file to search and keyword is the keyword to search in the PDF file or PDF's file name.
I'm using Spire.Pdf to read the PDFs.

Here is the method:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
    var results = new ConcurrentBag<KeyValuePair<string, string>>();

    var directory = new DirectoryInfo(path);
    var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);

    Parallel.ForEach(files, file =>
    {
        // Apri il file PDF
        var document = new PdfDocument(file.FullName);
        Console.WriteLine("\n\rRicerca per: " + keyword + " in file: " + file.Name + "\n\r");

        // Itera le pagine del documento
        for (int i = 0; i < document.Pages.Count; i++)
        {
            // Estrai il testo della pagina
            var page = document.Pages[i];
            var text = page.ExtractText();

            // Cerca la parola chiave
            keyword = keyword.ToLower().Trim();
            if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
            {
                results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
            }
        }
    });

    return results;
}

All works fine but when I have more than 200 keywords to search and more then 1500 files it's a bit slow. Is there something to do to optimize this loop?

答案1

得分: 1

> 我有超过200个关键词

而您正在加载所有PDF并为每个关键词逐个处理。我认为一次加载文件并检查所有关键词会更加高效:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string[] keywords)
{
    //...
    Parallel.ForEach(files, file =>
    {
        // ...
        for (int i = 0; i < document.Pages.Count; i++)
        {
            foreach (var keyword in keywords)
            {
                // 搜索关键词并将其添加到结果中
            }
        }
    }
    // ...  
}

接下来,您可以尝试优化的另一点是,针对页面/关键词对的搜索中断 - 因为您只关心关键词是否在文件中被找到,而不是在页面中找到 - 如果找到关键词(和/或所有关键词都找到),可以更早地中断搜索,例如通过维护已找到关键词的本地哈希集。

然后,优化搜索(如评论中建议的) - 无需使用 ToLower 创建一堆字符串并对GC产生压力 -

不要使用:

keyword = keyword.ToLower().Trim();
if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

只需使用:

if (text.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.Name.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.FullName.Contains(keyword, StringComparison.OrdinalIgnoreCase))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

还可能在进行全文搜索之前(也许在文件/页面加载之前)执行文件名和完整文件名的检查。

英文:

> I have more than 200 keywords

And you loading all pdfs and processing for every single one of them. I think it would be much more efficient to load file once and check it for all keywords:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string[] keywords)
{
    //...
    Parallel.ForEach(files, file =>
    {
        // ...
        for (int i = 0; i < document.Pages.Count; i++)
        {
            foreach (var keyword in keywords)
            {
                // search for keyword and add it to the results    
            }
        }
    }
    // ...  
}

Next thing you can try to optimize - break of search for page/keyword pair - since you care only about keyword being found in file not a page - break out earlier if the keyword was found (and/or all keywords were found), for example by maintaining local hashset of found keywords.

Then optimize the search (as suggested in comments) - no need create bunch of string by using ToLower and add pressure on the GC -

Instead of

keyword = keyword.ToLower().Trim();
if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

just use:

if (text.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.Name.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.FullName.Contains(keyword, StringComparison.OrdinalIgnoreCase))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

Also possibly perform file name and full file name checks before the fulltext search (maybe before file/page load).

答案2

得分: 1

  1. 如果您只对文件名感兴趣,应在第一次出现后停止处理页面。
  2. 如果关键字已经与文件名匹配,请勿打开并提取PDF。
  3. 使用StringComparison.OrdinalIgnoreCase来比较字符串,而不是调用ToLower。
英文:
  1. If you are only interested in the filename you should stop processing pages after the first occurrence.
  2. Do not open and extract the pdf if the keyword already matches the filename.
  3. Use StringComparison.OrdinalIgnoreCase to compare strings instead of calling ToLower.
public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
    var results = new ConcurrentBag<KeyValuePair<string, string>>();

    var directory = new DirectoryInfo(path);
    var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);

    Parallel.ForEach(files, file =>
    {
        if(file.Name.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0 || 
           file.FullName.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0)
        {
            results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
        }
        else 
        {
            // Apri il file PDF
            var document = new PdfDocument(file.FullName);
            Console.WriteLine("\n\rRicerca per: " + keyword + " in file: " + file.Name + "\n\r");

            // Itera le pagine del documento
            for (int i = 0; i < document.Pages.Count; i++)
            {
                // Estrai il testo della pagina
                var page = document.Pages[i];
                var text = page.ExtractText();

                if (text.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0)
                {
                    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
                    break;
                }
            }
        }
    });

    return results;
}

huangapple
  • 本文由 发表于 2023年1月9日 16:24:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75054699.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定