Download now or preview on posterous

what-matters-now-2.pdf (3073 KB)

Posted via email from Rich’s posterous

{ 0 comments }

Project review: ifwinsight.com

by Rich Brant on December 6, 2009

It’s easy to forget what you’ve learned and what tools you used from project to project. I thought it might be worthwhile to sort of sum up these things either on a weekly basis or project basis. I had a lot of fun on a recent project and thought it would be a good place to start. I recently built what is described as a ‘tool for intelligently searching US patent application Image File Wrappers (IFWs).’

Technically, the system allows users to upload PDF documents and have their content indexed and made searchable. The documents are reasonably sized, averaging 25 megs each with several hundred pages. So once uploaded to the server, they are handed to delayed job to be processed in the background. I’m using collective idea’s fork after watching Ryan Bates’ screencast on delayed job that points out this fork has a few generators and rake tasks not part of the original.

In order to index the document, the PDF needs to be examined by OCR (optical character recognition) software. But before the OCR software can do its OCR-ing, it needs to have an image to examine. So we need to convert the individual PDF pages into images. To accomplish that, I used ghostscript. It’s really easy to use and fast. You can hand ghostscript the document, and it will churn out a an image of each PDF page, in the resolution of your choice. I’m using 300×300, which seems to be a nice balance between processing time, space, and readability/ocr results.

Once the document has been converted into images the OCR software, tesseract-ocr, will iterate through each image and produce a text file with the contents of the page. Now, with a directory full of text files, it’s time to store the contents of each page in the database. That’s where sphinx and thinking sphinx come into play. Sphinx is the full text search engine and thinking sphinx is a ‘concise and easy-to-use Ruby library that connects ActiveRecord to the Sphinx search daemon, managing configuration and searching.’ I actually started the project with ferret/acts_as_ferret, but after reading so many good reviews of sphinx, and my own problems with ferret, I switched. The only downside is that setup is a little trickier and thinking sphinx doesn’t automatically update the index the way acts_as_ferret does, so there’s a cron job that handles that. The indexer is super fast though, so frequent indexing isn’t a problem.

The site also offers multiple file download, and I used the rubyzip library which makes it simple to zip up a bunch of docs into one.

As for design, we used a theme from themeforest.net. I was impressed by the quality of generic templates they have. They aren’t free, but are dirt cheap – $5 or $10  for most. I’ve used

It’s a Rails application, so as for gems/plugins, the usual suspects are there: acts_as_commentable, exception_notification, restful_authentication, role_requirement, mislav-will_paginate, attachment_fu, and a few others: slicehost, thinking-sphinx, delayed_job, and rubyzip.

The real jewel in the list is ‘slicehost‘ which gives you a bunch of rake tasks for setting up your slice at slicehost, which is my favorite hosting provider.

One other thing worth mentioning was an issue with the delayed_job process not stopping properly during deploys, so I kept getting multiple instances of delayed job running because the one running during the deploy never stopped. It was noted on github (with the solution below) in the issues section but I can’t find it now. Basically, the restart task looks like this:

desc "Restart the delayed_job process"
  task :restart, :roles => :app do
  stop
  wait_for_process_to_end('delayed_job')
  start
  end
end
 
def wait_for_process_to_end(process_name)
  run "COUNT=1; until [ $COUNT -eq 0 ]; do COUNT=`ps -ef | grep -v 'ps -ef' | grep -v 'grep' | grep -i '#{process_name}'|wc -l` ; echo 'waiting for #{process_name} to end' ; sleep 2 ; done"
end

As a final note, all the software used in this project is open source. I’m constantly reminded of that and impressed by it. My thanks to all who have contributed to the software used in this project!

{ 0 comments }

upgrading to snow leopard – issues with mysql and gems

August 30, 2009

Was definitely worth upgrading to snow leopard from leopard. My machine is noticeably quicker and I picked up quite a bit of new storage space, but it was not without its issues on the dev side of things. The most affected area so far has been mysql.
You need to get the 64 bit version:
http://dev.mysql.com/downloads/mysql/5.1.html#macosx-dmg
I know [...]

Read the full article →

Awesome talk at biz of software conf: Seth Godin on why marketing is too important to be left to the marketing department

July 31, 2009

Read the full article →

‘Maker’s Schedule, Manager’s Schedule’ by Paul Graham

July 24, 2009

http://www.paulgraham.com/makersschedule.html  so incredibly well said.
Posted via email from Rich’s posterous

Read the full article →