Thursday, April 12, 2007

Faking cursors in ActiveRecord

Faking cursors in ActiveRecord
from the { buckblogs :here } - Home by Jamis

There are times (like, in a migration, or a cron job) where I want to operate on large numbers of rows in the database, such as for billing, where you want to select all accounts who are due for automatic renewal, or when adding a new column to a table that you need to prepopulate with computed data.

One way to do that is just to brute force it:
Account.find(:all).each do |account|
# ...
end
The drawback here is obvious: when you’re dealing with hundreds of thousands or even millions of rows, selecting them all into memory at once is brutal. And since ActiveRecord doesn’t support cursor-based operations, you can’t just ask ActiveRecord to return the rows as it reads them.

Here’s a trick I’ve been using recently to query large result sets while being friendly to the computer:


class < def each(limit=1000)
rows = find(:all, :conditions => ["id > ?", 0], :limit => limit)
while rows.any?
rows.each { |record| yield record }
rows = find(:all, :conditions => ["id > ?", rows.last.id], :limit => limit)
end
self
end
end

Account.each do |account|
# ...
end

Sadly, this won’t work on every DBMS, or with every query; it exploits several idiosyncrasies of MySQL which might not be present on other DBMSs:

* MySQL sorts indexes.
* The primary key is an index.
* Queries which MySQL determines can be best satisfied by the primary key, then, will be returned in sorted order.

This means that if you try to add additional conditions to the query, you’ll also need to add an :order clause to sort by the id…and this will more than likely cause the performance of the query to go down the tubes. But for those queries where you just want to select every row anyway, it works quite well. You could use OFFSET and LIMIT, but OFFSET begins to be really, really slow when the OFFSET is in the tens of thousands or higher because it has to count through that many rows before finding where to begin returning data. Basing the query on id, like this, has the advantage of speed, because the database can use indexes like it was meant to.

No comments :