Traditionally, there is a one to one relationship between items in a document corpus and their corresponding ‘Solr Document’ in a solr index. so for example, an index built from this page,
would contain a single ‘Solr Document’ which would be findable with queries like:
- thrill jockey
- beach house
- eno
each returning a single search result, specifically, a link to the resource at
This is simple and straight forward, but at the same time, the resulting search experience can be annoying to users. While the other 49 reviews may be interesting, I’m only really concerned with what Eno is doing.
If, however, Google’s Rich Data Snippets really takes off, and users start publishing content that is meaningful to machines ( RDFa, Microformats ), the game will be completely changed.
In Google’s vision, pitchfork would publish the top 50 embedded with RDFa or Microformats, kind of like this.
<span xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review">
<span property="v:itemReviewed">David Byrne and Brian Eno</span>
<span property="v:reviewer">Douglas Wolk</span>
<span property="v:rating">5.0</span>
<span property="v:dtreviewed">1st April 2008</span>
<span property="v:summary">David Byrne has spent most of his career banging against the clear but impermeable window that
separates him from normalcy</span>
</span>
This is cool and the largest step yet towards TBL’s vision of a semantic web, but it presents a problem as I try to implement in a search index.
To take advantage of the inherent structure in this updated version of the record review page, I’d require either an index that supports one to many object relations, or, an index with 50 unique Documents, one for each record review.
I’m not sure the former is even possible with the tools I have available. While Solr supports multi valued string fields, there’s no built-in way to maintain the relationships defined in the RDFa representation.
I could come up with a new Solr field, something like, rdf_field, that supports structured data and corresponding one-to-many relationships. For example,
<doc>
<str name="title">Top 50 2008</str>
...
<lst name="reviews">
<lst name="review/one">
<str name="itemReviewed">David Byrne and Brain Eno</str>
<str name="reviewer">Douglas Wolk</str>
</lst>
<lst name="review/two">
<str name="itemReviewed">Grizzley Bear</str>
<str name="reviewer">Douglas Wolk</str>
</lst>
</lst>
...
</doc>
This is similar to Fast scope fields, which we’ve used to implement some similar functionality.
An alternative is to index every review as a unique SolrDocument with its own unique contentid, something like …
<doc>
<str name="contentid">http://pitchfork.com/features/staff-lists/7573-the-50-best-albums-of-2008/#41</str>
<str name="url">http://pitchfork.com/features/staff-lists/7573-the-50-best-albums-of-2008/"</str>
<str name="itemReview">David Byrne and Brain Eno</str>
<str name="reviewer">Douglas Wolk</str>
</doc>
Who knows what Google is going to do, they risk further disenfranchising content authors completely ( NYTIMES RDFa?) , if they take an objectionable approach.
For me though, this is big, because an implementation for the enterprise would go a long way towards moving the responsibility for relevancy from the search engine technology ( a.k.a. this engine sucks, why can’t we use Google) back to the content authors.
i’m just now thinking about this and playing around a bit. i understand i’m in the weeds here. taking requests.