英文:
Does the Size of Azure Search Index impact performance/accuracy?
问题
我有1000个PDF文件(每个文件有200页)。
我需要将每个PDF添加到Azure搜索索引中(作为小文本块和相关的元数据,例如每个PDF 200个块)。
已经达到了最大可能的50个搜索索引限制,之后我被提示删除一些索引...
我现在计划只使用一个单一的搜索索引,并将所有1000个PDF文件添加到一个单一的搜索索引中。然后,我计划在.search
方法中使用筛选功能,根据PDF文件名进行筛选。
这个计划是否有不足之处?
英文:
I have 1000 pdfs(200 pages each).
I need to add each pdf to the index (as small text chunks and relevant metadata, for example 200 chunks per pdf) in Azure Search Index
Have already hit the limit of 50 max possible Search Indexes after which I am getting prompted to delete few indexes..
I am now planning to just have 1 Single Search Index and add all 1000 in a Single Search Index. I then plan to use filtering in the .search method and will filter based on the pdf name.
Does this plan have a downside ?
答案1
得分: 1
使用一个大索引来存储与给定应用程序相关的所有文档是正确的方法。正如你提到的,你可以使用一个可筛选的字段来存储文档的名称以限定搜索范围。这还具有一个优点,即如果你需要在所有或多个文档之间进行搜索,它将可以正常工作。
就缺点而言:
- 在你描述的数据量(约200K页)下,索引的大小完全不会成为问题。如果你有数亿或数十亿页的情况下,分区索引可能更好,但这需要更多的工作,如果你不认为会达到那种数据量,我会建议避免这样做。
- 如果文档具有非常不同的数据分布特性,例如非常不同的词汇表,那么混合文档可能会导致在评分过程中出现一些统计上的奇怪现象(例如,在一个文档中很少见的词汇在其他文档中可能很常见,从而扭曲统计数据)。
我列出了这些缺点以提供完整性,根据你描述的情况,很可能一个单一的索引就足够了。
英文:
Using a single large index for all documents related to a given application is the right approach for this. As you mentioned, you can use a filterable field with the name of the document to scope searches. This also has the advantage that if you ever need to search across all or multiple documents, it'll just work.
In terms of downsides:
- At the volumes you're describing (~200K pages), the size of the index won't be a problem at all. If you had 100s of millions or billions of pages, there's a point where partitioning the index would be better, but that's more work and I would avoid it if you don't think you'll hit that kind of data volume.
- If the documents have very different data distribution properties, for example very different vocabulary, then mixing the documents could cause some statistical oddities during scoring (e.g. a term that's rare in one document might be common in the rest of documents, skewing statistics that way).
I'm listing downsides for completeness, from the description of your scenario, most likely a single index will be just fine.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论