Enterprise Search Capacity Planning

Sizing hardware and software requirements for enterprise search deployments can be kind of a crap shoot. I usually use a quick formula to give Level 0 estimate of required storage capacity

Storage (GB) = Doc. Corpus Size + Multiplier(Doc Corpus Size)

where Multiplier is a rough guess at index expansion based on instinct after taking a look at the content to be indexed. I usually use 2.5.

Except for sometimes, 2.5 can be more like 8 and that turns 100GB into 400GB, which in turns causes issues for any client trying to order hardware in a modern world. In order to better estimate the value of Multiplier, we’ve deployed a quick Lucene Indexer that works on a subset of the corpus.

http://solrhack.wordpress.com/downloads/utilities/indextools.tar

This helps in capacity planning and in technical requirements specification:
– High Multiplier values should set off an alarm.
– Is the content binary data heavy ?
– Is that binary data being tokenized by the analyzer.
– What about numeric content?
– multi languages?

Take a look at the doc in the link below. What tokens would users want searchable and conversely, what tokens would contribute to index expansion but not to the value of the search experience.

http://sandp.ecnext.com/coms2/summary_0260-235663_ITM

A recent integration project, requiring an index of a half million docs like this one, resulted in index expansion close to 10X. By removing numeric content in the Analyzer, it was possible to bring the content and index size back to earth. We ended up with only two times index expansion and an overall storage savings of almost 300GB with little to no impact on the users perception of relevancy.

rs=’[$%]{0,1}?\d*?[.,]{0,1}?\d[.,]{0,1}\d*[%]*’

One Response to “Enterprise Search Capacity Planning”

  1. Enterprise Search Storage Estimator : Beyond Search Says:

    [...] of thumb” estimator for capacity planning. You can read the article and see the formula here. The article is called “Enterprise Search Capacity Planning.” Keep in mind that the [...]

Leave a Reply