Solr StatsComponent

July 10, 2009

Solr’s StatsComponent calculates some basic stats across the values of numeric fields in the result set. To test drive this feature, I took a few real estate transactions …

Address List Price Type Bedrooms Bath Sales Price Days on Market
58 Hilighton Avenue $264,900 Colonial 3 1.1 $266,000 173
168 Barbetett Avenue $265,000 Colonial 3 2.1 $240,000 290
28 Welseley Street $275,000 Colonial 3 1 $287,000 11
35 Norwallk Avenue $475,000 Colonial 3 1.1 $487,000 8
15 Franklins Place $479,000 Colonial 4 1.1 $450,000 13

processed and sent them to Solr for indexing and executed this query.

http://localhost:8080/solr/select?q=*:*&stats=true&stats.field=list&stats.field=purchase&rows=0&indent=true

easy. here’s the results.

May Home Sales
Min: 264900.0
Max: 735000.0
Sum: 6744900.0
Count: 13
Mean: 518838.4
StdDev: 162831.3

I’d be nice to be able to specify calculations in the query, so for example, to find the average income per agent in the month of May ( lets say the town has 100 agents, sharing 6% commission )

http://localhost:8080/solr/select?q=*:*&stats=true&stats.field=purchase&stats.agentincome=(stats.sum*0.06/100)

Average Agent Income: $4046.0:

This is cool but kind of basic stuff. The real challenge to Business Intelligence will come when we can start answering questions like:

  • Q: Which types of homes had median sales price > 500K in May ? A: Tudors, Victorians, not Capes ..

set finding.


FRCP Compliance questions

July 9, 2009

“Is [ product X ] FRCP compliant?” …

a common question from a common architect. Unfortunately, I was stuck without an answer.

Here’s my draft response for future reference.

“”"
Amendments to the The Federal Rules of Civil Procedure, 2006, mandate the recovery and hold of electronically stored information during legal discovery. The amendments don’t, however, define a rule or feature set that can be used to establish a clear path for an industry certification program. In short, FRCP Certification does not exist. Any ediscovery product must be reviewed against the specific legal context in which it is employed and parties should be able to demonstrate due diligence in choosing a particular information retrieval product from a vendor.
”’

”’
In support, here are a few cases in ediscovery. This will be an open thread and mostly for my reference.

1. Bray & Gillespie sv. Lexington

Past rulings can help rule out specific products. For example ‘Extractiva’, which was used to convert Word and PDF docs to TIFF files, subsequently losing meta data critical to details of the case. oops.

2. Victor Stanley, Inc. v. Creative Pipe, Inc.,

This ruling helps, at least, to get your head around what you should be looking at ….

”’
[The] Defendants [ ... ] have failed to provide the court with information regarding:

  • the keywords used;
  • the rationale for their selection;
  • the qualifications of M. Pappas and his attorneys to design an effective and reliable search and information retrieval method;
  • whether the search was a simple keyword search, or a more sophisticated one, such as one employing Boolean proximity operators;
  • or whether they analyzed the results of the search to assess its reliability, appropriateness for the task, and the quality of its implementation.
  • ”’

    2. United States v. O’Keefe

    you see, it’s, like, complicated ….

    ”’
    Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics . . . . Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.
    ”’

    Reference:
    * What is FRCP Compliance
    * Sedona Best Practices on Info Retreival and Search


ediscovery: look a little closer

July 7, 2009

I was confused by Lucid’s announcement that they would begin reselling ISYS Document Converter Filters as part of the Lucid Works Platform. The open source alternatives, TIKA, passed my admittedly cursory testing and seemed to be a strong candidate for enterprise adoption.

This week though, I’ve been dragged into the eDiscovery pool headfirst, coming face to face with a new set of requirements for search and information retrieval that gives new understanding into Lucid’s choice to use ISYS.

Specifically, how do you prove that the gather and extract process is non-lossy, that is, all information available in Electronically Stored Information (ESI) is available to the analyst in an ediscovery project?

In Bray & Gillespie sv. Lexington lossy conversion resulted in millions of dollars lost in legal and settlement fees. In brief, ESI consisting of Emails, Word Docs and PDF’s were converted to TIFF Files before being sent to the Data Warehouse for the Delivery Phase. While most of us understand the problem here, it took counsel some 50 pages of explanation to communicate this to the parties involved.

While this case seems extreme, I could imagine a similar case in which Apache Tika was used to convert content into XML representations and uploaded into a master data warehouse. In litigation, it’d be difficult to prove the effectiveness of the Tika converters, while the ISYS or Stellent or IBM ones, with precedence established in prior ediscovery cases, could be easily defended.

Lesson One: If you’re planning to do eDiscovery with your search, spring for the good converters.


goldmine

July 1, 2009

I don’t know anything about Hathi Trust, but in general, research results of this quality aren’t freely available. Thanks. I can take this report, hand it over to an engineer and say “use this as an outline to benchmark this weeks Lucidworks release vs. Fast”

Hathi Trust Solr Benchmarking


faceted search in reverse

June 19, 2009

Typically, faceted search applications allow the user to refine
an initial query by selecting or filtering on a set of data facets e.g. best buy, forrester, volkswagen. The entry point is almost exclusively a text search box or a list of facets and facet values.

With supercook, the model is reversed. Facets are searchable while the entities themselves are only findable.


classification engines

June 16, 2009

I just took the Reverend Bayes Classifier for a test drive.

I decided to benchmark classification engines by measuring their ability to classify a set of speech transcripts collected from Real Clear Politics by their author: Obama or Mccain.

1. I took 10 speeches from each author and saved text locally
2. I trained the Reverend Classifier
3. I ran the Classifier on an ‘unknown’ speech and checked the results.


import os
import sys

from thomas import Bayes
guesser = Bayes()

def train_directory(classifier,directory,set):
        for file in os.listdir(directory):
                print 'training %s with %s/%s' % (set,directory,file)
                classifier.train(set, open(directory + '/' + file).read())

train_directory(guesser,'obama','obama')
train_directory(guesser,'mccain','mccain')

print guesser.guess(open(sys.argv[1]).read())

Unfortunately, 10 speeches doesn’t look like enough of a training set.
Here are the results for my two ‘unknown’ docs.

$ python test.py unknown/obama_10_21_Compete
[('obama', 0.84418617054938561), ('mccain', 0.40675278974192031)]

$ python test.py unknown/mccain_10_24_2008_Denver
[('obama', 0.72855668654272199), ('mccain', 0.61095539776103958)]

As a next step, I’ll expand my test case and give a few other candidates a try: Fast Classifier, IBM’s offering, etc. Any open source suggestions?


solr searchcomponent query processing

June 8, 2009

Here’s the ‘process’ method of a SearchComponent that effectively reverses the query string. useless outside of demonstrating solr extensibility in runtime query processing using a pipeline-like design pattern.


@Override
        public void process(ResponseBuilder rb) throws IOException {
                SolrQueryRequest request = rb.req;
                String defType = params.get(QueryParsing.DEFTYPE);
                defType = defType==null ? QParserPlugin.DEFAULT_QTYPE : defType;
                try {
                      QParser parser = QParser.getParser(reverse.process(rb.getQueryString()), defType, request);
                      rb.setQuery( parser.getQuery() );
                      rb.setQparser(parser);
                } catch (ParseException e) {
                      throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, e);
                }
        }

In solrconfig.xml, add this line under the top level config node


  <searchComponent name="reverse" class="com.solrhack.ReverseComponent" />

and modify the default request handler’s first-components


  <requestHandler name="standard" class="solr.StandardRequestHandler">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <!--
       <int name="rows">10</int>
       <str name="fl">*</str>
       <str name="version">2.1</str>
        -->
     </lst>
        <arr name="first-components">
        <str>reverse</str>
        <str>reverse</str>
        </arr>
  </requestHandler>

allright it works. but by a shoe string.


The case for Real Time Search – “Exhibit A”

June 1, 2009

Twitter: 1
Google: 0

twitter versus google


the new enterprise search

June 1, 2009

My friends at Zero Hedge have been nice enough to publish not only an exemplary use case of enterprise search for discovery and disclosure, but an outline of their solution and the results of their findings. This is the new Enterprise Search: it’s business intelligence, lexical analysis, faceted navigation, semantics, extraction, correlation and reporting.

Look at the tools in use by the Zero Hedge solution,

  • ksh
  • a parser written in C
  • SQL
  • Lots of Excel
  • mathematica or some variant

A large organization finds itself with thousands of use cases similar to this one. They simply cannot afford to throw smart money ( developers, analysts, librarians, quants) at any significant subset of the problems presented. We need new tools, new ideas.

While Microsoft, Autonomy, or Endeca technology may be useful in providing insight into this use case, I’m more interested in the promises of Attivio or maybe Mark Logic in this space. wondering what would wolphram do. others? still listening ?, go build it

It’s a new case that challenges what it means to find and sits nicely between old school BI and search.

Most statisticians would not call this a “find” as 95% confidence intervals are the gold standard for this sort of work.


google rich data snippets and solr

May 18, 2009

Traditionally, there is a one to one relationship between items in a document corpus and their corresponding ‘Solr Document’ in a solr index. so for example, an index built from this page,

would contain a single ‘Solr Document’ which would be findable with queries like:

  • thrill jockey
  • beach house
  • eno

each returning a single search result, specifically, a link to the resource at

This is simple and straight forward, but at the same time, the resulting search experience can be annoying to users. While the other 49 reviews may be interesting, I’m only really concerned with what Eno is doing.

If, however, Google’s Rich Data Snippets really takes off, and users start publishing content that is meaningful to machines ( RDFa, Microformats ), the game will be completely changed.

In Google’s vision, pitchfork would publish the top 50 embedded with RDFa or Microformats, kind of like this.


<span xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review">
   <span property="v:itemReviewed">David Byrne and Brian Eno</span>
   <span property="v:reviewer">Douglas Wolk</span>
   <span property="v:rating">5.0</span>
   <span property="v:dtreviewed">1st April 2008</span>
   <span property="v:summary">David Byrne has spent most of his career banging against the clear but impermeable window that
separates him from normalcy</span>
</span>

This is cool and the largest step yet towards TBL’s vision of a semantic web, but it presents a problem as I try to implement in a search index.

To take advantage of the inherent structure in this updated version of the record review page, I’d require either an index that supports one to many object relations, or, an index with 50 unique Documents, one for each record review.

I’m not sure the former is even possible with the tools I have available. While Solr supports multi valued string fields, there’s no built-in way to maintain the relationships defined in the RDFa representation.

I could come up with a new Solr field, something like, rdf_field, that supports structured data and corresponding one-to-many relationships. For example,


<doc>
<str name="title">Top 50 2008</str>
...
<lst name="reviews">
  <lst name="review/one">
     <str name="itemReviewed">David Byrne and Brain Eno</str>
     <str name="reviewer">Douglas Wolk</str>
   </lst>
  <lst name="review/two">
     <str name="itemReviewed">Grizzley Bear</str>
     <str name="reviewer">Douglas Wolk</str>
   </lst>
</lst>
...
</doc>

This is similar to Fast scope fields, which we’ve used to implement some similar functionality.

An alternative is to index every review as a unique SolrDocument with its own unique contentid, something like …


<doc>
<str name="contentid">http://pitchfork.com/features/staff-lists/7573-the-50-best-albums-of-2008/#41</str>
<str name="url">http://pitchfork.com/features/staff-lists/7573-the-50-best-albums-of-2008/"</str>
<str name="itemReview">David Byrne and Brain Eno</str>
<str name="reviewer">Douglas Wolk</str>
</doc>

Who knows what Google is going to do, they risk further disenfranchising content authors completely ( NYTIMES RDFa?) , if they take an objectionable approach.

For me though, this is big, because an implementation for the enterprise would go a long way towards moving the responsibility for relevancy from the search engine technology ( a.k.a. this engine sucks, why can’t we use Google) back to the content authors.

i’m just now thinking about this and playing around a bit. i understand i’m in the weeds here. taking requests.