User:Zwitter/Search design

some thinking about the new search.

when i implemented lucene for Wikipedia, i used a custom protocol to communicate with the front end. Brion later replaced this with an HTTP server. i think this was a mistake and makes for unnecessary maintenance troubles; instead i will probably use Tomcat as a front-end server to the search.

request: GET /search?wiki=anglish&term=some+search return: results=20 result=anglish|some_page result=wikicities|some_other_page

(XML-RPC/SOAP? WSDL? i don't think there's any point, it's not an external service)

need to decide how to order the results (main wiki first then the rest?) and how to display them on the page. some higher ranking needs to be given to local pages. add an option to prevent other wikis being searched. maybe split the page in two but this will be annoying on small displays. could use JS to let the user hide one or the other.

the wiki name can be stored as a lucene document attribute.

need some way to update the index without interrupting service: lucene doesn't let you add documents while the index is open.

could let the server know about new pages from the wiki, or else poll the database every few minutes. store the current timestamp _after_ update and look for pages after this: don't lose any updates from crashing.

updates
a method is needed to update pages while they're in the index.

lucene allows non-concurrent updating and indexing automatically, i.e. you can do both at once and the searches will block until the update is done.


 * have a separate daemon which polls the database every minute looking for changed articles (cur_timestamp >= X). read the contents, write them to the index.
 * store the last checked timestamp for each db in last_updated table
 * possibly use HTTP for incremental indexing to allow parsing of templates etc? or else call some batch parser... (Zend is embeddable, may be possible to have a JNI-based wikitext parser)

need a way to identify documents to remove them. store namespace+title as keywords (currently only stores them as text).

for deleted articles: use the logging table to find articles which have been removed and delete them. need to do this before the other updates so deleted and re-created articles don't get lost.