AWSRecord: Adding SimpleDB to S3Record for scalable queries
tag: simpledbtag: s3tag: rubytag: awsrecord
It's been a while since I had a technical blog post, so today I
thought I'd show some code I wrote last weekend to experiment with
Amazon's SimpleDb.
Background
Last summer, there was some discussion about using Amazon S3 for object persistance, rather
than file storage. One well-written approach is laid out in
this blog post. Unfortunately the nice code formatting and colors
seem to have disappeared, so use the wayback
machine's version.
I implemented something very similar to what's described, and that's
what's serving as my database on several side projects, most notably
FeedSalon.
It's a great way to not have to worry about a database if you're
running out of EC2. Not only is all your data backed up automatically,
but you can just add more EC2 instances and they all just work
together (under my high-latency/eventual-consistency uses at least).
The Problem
One of the main downsides of S3Record is that in order to do any
searching or sorting, one must iterate through every object, which is
completely unacceptable.
However, Amazon recently launched a new service called SimpleDB that
is designed to organize data using key-value pairs. So by extending
S3Record to store certain fields in a SimpleDB "domain", we can then
query those fields as desired.
Thus is born AWSRecord
Implementation
I started with my version of S3Record, which is very similar to the
one presented in
the blog entry linked above, so take a glance there first.
All we're doing is picking a bucket, and for any given key, storing a
YAML representation of the object associated with that key. I think
the only significant interface change in my version is that the bucket
name is pulled into its own function for ease of changing:
def self.bucket
return "#{ self.name.downcase }.caseybucket"
end
So for example, if we were creating an S3-backed User class for
FeedSalon, you could create something like this:
require 's3record'
class User < S3Record
attr_accessor :name, :age, :zipcode
def self.bucket
return "user.s3record"
end
end
and then start creating, reading, and updating users on a whole bunch
of machines without much scaling effort at all.
Let's start working on AWSRecord, which will extend this to be
queryable using SimpleDB.
We'll pick a similarly named SimpleDB domain, which unlike an S3
bucket doesn't have to be globally unique.
require 's3record'
require 'aws_sdb'
class AwsRecord < S3Record
def self.domain
return "#{ self.name.downcase }.record"
end
Let's leave the calculation of the queryable fields very flexible,
we're not trying to build ActiveRecord here. We can do this by just
providing a hash that the child classes can fill and calculate however
they want.
def query_attributes
{}
end
One of the issues with SimpleDB is the lack of libraries (it's still
in beta), so I picked the most
mature looking one
(although I was tempted to
support
the nytimes instead). Unfortunately, it doesn't have quite the
same interface as our S3 library, so we'll use a little singleton
pattern that we can sub out later if another library looks better.
@SDB_SERVICE = nil
def self.sdb
@SDB_SERVICE ||= AwsSdb::Service.new(Logger.new(nil), 'ACCESS_KEY', 'PRIVATE_KEY')
end
The meaty part is that on update or delete, we keep the SimpleDB
attributes in synch (create just uses the update method).
def update
super
self.class.sdb.put_attributes(self.class.domain, @key, query_attributes)
end
def self.delete(key)
super(key)
self.sdb.delete_attributes(self.domain, key)
end
And finally the juicy part is that we can make queries which will
return keys that can then be fetched.
def self.query_keys(query, max = nil, token = nil)
self.sdb.query(self.domain, query, max, token)
end
Let's test it by adding an extra method to our User class. We'll set
it up to query all three fields directly.
def query_attributes
{
:name => name,
:age => age,
:zipcode => zipcode
}
end
Trying creating a few:
$ irb
> require 'user'
=> true
> User.new(:key => 'caseymrm', :name => "Casey Muller", :age => 27, :zipcode => 93023).create
=> nil
> User.new(:key => 'casey', :name => "Casey the Great", :age => 27, :zipcode => 98102).create
> User.new(:key => 'nephew', :name => "Nephew", :age => 9, :zipcode => 93023).create
=> nil
> User.all
=> [#<User:0xb75cb1e4 @age=27, @key="casey", @created_at=Sat Mar 15 17:42:14 -0700 2008, @zipcode=98102, @name="Casey the Great">, #<User:0xb75c382c @age=27, @key="caseymrm", @created_at=Sat Mar 15 17:42:59 -0700 2008, @zipcode=93023, @name="Casey Muller">, #<User:0xb75b8314 @age=9, @key="nephew", @created_at=Sat Mar 15 17:44:58 -0700 2008, @zipcode=93023, @name="Nephew">]
Okay, how about a nice SimpleDB query?
> User.query_keys("['zipcode' = '93023']")
=> [["caseymrm", "nephew"], ""]
Looks like it works, try something more complicated, let's say there's
a mature FeedSalon section, and we want users over 18:
> User.query_keys("['age' > '18']")
=> [["casey", "caseymrm", "nephew"], ""]
Uh oh, why did the 9 year old nephew come up? SimpleDB does all
lexicographical comparisons, so since 9 is greater than the leading 1
of 18, that record was returned. The solution is to pad numbers, so
let's add a couple of helpers.
def self.pad_num(number, max_digits = 10)
"%%0%di" % max_digits % number.to_i
end
def self.query_keys(query, pad = true, max = nil, token = nil)
query = query.gsub(/\d+/) {|n| self.pad_num(n)} if pad
self.sdb.query(self.domain, query, max, token)
end
And adjust our User:
def query_attributes
{
:name => name,
:age => self.class.pad_num(age),
:zipcode => zipcode
}
end
We'll need to update to get the numbers padded in SimpleDB, then let's
try the query again.
> User.all.each{|u| u.update}
=> [#<User:0xb74cf100 @age=27, @key="casey", @created_at=Sat Mar 15 17:42:14 -0700 2008, @zipcode=98102, @name="Casey the Great">, #<User:0xb74cc518 @age=27, @key="caseymrm", @created_at=Sat Mar 15 17:42:59 -0700 2008, @zipcode=93023, @name="Casey Muller">, #<User:0xb74c9930 @age=9, @key="nephew", @created_at=Sat Mar 15 17:44:58 -0700 2008, @zipcode=93023, @name="Nephew">]
> User.query_keys("['age' > '18']")
=> [["caseymrm", "casey"], ""]
So there you have it, S3Record with SimpleDB queries on selected
fields.
Conclusions
It's not really written the Ruby on Rails way, I think for that you'd
want to embed the SimpleDB information in the attributes directly. But
once the child class is written, I find it very easy to work with the
data in the actual application code.
Like I said, I whipped this up last weekend, but I'm not using it
anywhere, because of a couple of issues.
Price
Wherever I use S3Record, it's a very read-heavy application, and I use
a local cache on each machine. Make sure you read up on the S3
per-request charges as well as the SimpleDB machine hour and indexing
costs. This technique has a very predictable and scalable cost, but if
your revenues don't increase linearly with activity (or are
non-existant), be careful.
Sorting
The biggest thing I wanted from SimpleDB was querying and sorting. It
turns out you don't get sorting... explcitly.
Actually, people on the forums have found a hacky workaround, check it out:
> TestConsumer.new(:key => 'test1', :age => 15).create
=> nil
> TestConsumer.new(:key => 'test2', :age => 25).create
=> nil
> TestConsumer.new(:key => 'test3', :age => 35).create
=> nil
> TestConsumer.query_keys("")
=> [["test2", "nephew", "test3", "test1", "caseymrm", "casey"], ""]
> TestConsumer.query_keys("['age' > '0']")
=> [["nephew", "test1", "test2", "caseymrm", "casey", "test3"], ""]
Apparently if you add an intersection to the end with a test of
(field) > 0, the results reliably come back sorted. This is an
undocumented feature though, and requires a lot of work if you want to
also maintain a backwards index for reverse sorting, etc.
If anybody else is actually using a technique like this in production,
I'd be very interested in hearing about it.