I’m getting closer to early alpha release of an xml scanner for Solr. Xml Scanner is pure python software that can be used to update and maintain the state of Solr Indexes. I’m working through the class model now, and building prototype implementations as i work.
Sending adds and deletes to Solr is simple enough, the python code below will handle it.
import xml.dom.minidom
from httplib import HTTPConnection
class Solr:
def __init__(self):
self.server='localhost'
self.port = str(8080)
def add(self,doc):
"""
Add a document to index, where document is text rep
"""
return self.send(doc)
def commit(self):
"""
commit
"""
post = ''
return self.send(post)
def send(self,post):
con = HTTPConnection(self.server+':'+self.port)
con.putrequest('POST', '/solr/update/')
con.putheader('content-length', str(len(post)))
con.putheader('content-type', 'text/xml; charset=UTF-8')
con.endheaders()
con.send(post)
r = con.getresponse()
if str(r.status) == '200':
print r.read()
return 0
else:
print r.status
print r.read()
return -1
def main():
solr = Solr()
solr.add(open(sys.argv[1]).read())
solr.commit()
if __name__ == "__main__":
import sys
main()
Unfortunately though, I found some unexpected behavior when testing updates. Consider the requirement to update a single field of a document where only the primary key and a single field value is available at time of execution. A common example may be a product catalog index in sync with a relational database, in which catalog items should be pulled from the search index if the ‘instock’ field is set to value 0 in the database table.
This requires the ability to update only a single field of the index, without overwriting the current indexed data. I tried this out by first sending a GE Oven using my Python Class.
<add><doc> <field name="id">2234</field> <field name="name">Monogram Oven</field> <field name="manu">General Electric</field> <field name="inStock">1</field> </doc></add>
and then, trying to update the entry by sending the following.
<add><doc> <field name="id">2234</field> <field name="inStock">0</field> </doc></add>
This doesn’t work at all. The latter XML snip clobbers the first, overwriting the product detail in the search index. A google search shows evidence that this has been addressed, and possibly, fixed in Solr 1.3 ( JIRA-139). Hopefully, but until then, i’ll need the entire do contents to do an update.