Posted via email from Rich’s posterous
{ 0 comments }
Posted via email from Rich’s posterous
{ 0 comments }
It’s easy to forget what you’ve learned and what tools you used from project to project. I thought it might be worthwhile to sort of sum up these things either on a weekly basis or project basis. I had a lot of fun on a recent project and thought it would be a good place to start. I recently built what is described as a ‘tool for intelligently searching US patent application Image File Wrappers (IFWs).’
Technically, the system allows users to upload PDF documents and have their content indexed and made searchable. The documents are reasonably sized, averaging 25 megs each with several hundred pages. So once uploaded to the server, they are handed to delayed job to be processed in the background. I’m using collective idea’s fork after watching Ryan Bates’ screencast on delayed job that points out this fork has a few generators and rake tasks not part of the original.
In order to index the document, the PDF needs to be examined by OCR (optical character recognition) software. But before the OCR software can do its OCR-ing, it needs to have an image to examine. So we need to convert the individual PDF pages into images. To accomplish that, I used ghostscript. It’s really easy to use and fast. You can hand ghostscript the document, and it will churn out a an image of each PDF page, in the resolution of your choice. I’m using 300×300, which seems to be a nice balance between processing time, space, and readability/ocr results.
Once the document has been converted into images the OCR software, tesseract-ocr, will iterate through each image and produce a text file with the contents of the page. Now, with a directory full of text files, it’s time to store the contents of each page in the database. That’s where sphinx and thinking sphinx come into play. Sphinx is the full text search engine and thinking sphinx is a ‘concise and easy-to-use Ruby library that connects ActiveRecord to the Sphinx search daemon, managing configuration and searching.’ I actually started the project with ferret/acts_as_ferret, but after reading so many good reviews of sphinx, and my own problems with ferret, I switched. The only downside is that setup is a little trickier and thinking sphinx doesn’t automatically update the index the way acts_as_ferret does, so there’s a cron job that handles that. The indexer is super fast though, so frequent indexing isn’t a problem.
The site also offers multiple file download, and I used the rubyzip library which makes it simple to zip up a bunch of docs into one.
As for design, we used a theme from themeforest.net. I was impressed by the quality of generic templates they have. They aren’t free, but are dirt cheap – $5 or $10 for most. I’ve used
It’s a Rails application, so as for gems/plugins, the usual suspects are there: acts_as_commentable, exception_notification, restful_authentication, role_requirement, mislav-will_paginate, attachment_fu, and a few others: slicehost, thinking-sphinx, delayed_job, and rubyzip.
The real jewel in the list is ‘slicehost‘ which gives you a bunch of rake tasks for setting up your slice at slicehost, which is my favorite hosting provider.
One other thing worth mentioning was an issue with the delayed_job process not stopping properly during deploys, so I kept getting multiple instances of delayed job running because the one running during the deploy never stopped. It was noted on github (with the solution below) in the issues section but I can’t find it now. Basically, the restart task looks like this:
desc "Restart the delayed_job process"
task :restart, :roles => :app do
stop
wait_for_process_to_end('delayed_job')
start
end
end
def wait_for_process_to_end(process_name)
run "COUNT=1; until [ $COUNT -eq 0 ]; do COUNT=`ps -ef | grep -v 'ps -ef' | grep -v 'grep' | grep -i '#{process_name}'|wc -l` ; echo 'waiting for #{process_name} to end' ; sleep 2 ; done"
end
As a final note, all the software used in this project is open source. I’m constantly reminded of that and impressed by it. My thanks to all who have contributed to the software used in this project!
{ 0 comments }