February 13th, 2009
I haven’t finished the code to implement this idea yet. I’m forced to show my hand since today is the last day in On Ruby’s CouchDB Contest .
A bloom filter is data structure with some very cool properties. It’s very compact and performant. It represents an approximation to set membership. That is, it represents a collection of elements without storing the elements. A oft quoted example is that of a dictionary—one can encode every word in the dictionary (taking up a lot of space) into a substantially smaller bloom filter. This comes with a cost, when a bloom filter says an element is not in the set, it’s not in the set. However, when a bloom filter says that an element is in the set, it could be wrong, but the expected probability of it lying can be ascertained from the size of the bloom filter and how many hashes it utilizes.
Properties of Bloom Filters
- Space efficient.
- Speed efficient (constant time adding elements, constant time query).
- Uses k hashing functions (where k is the user selected number of hashing functions to use, determines how accurate the filter is).
- Combinable. The union of two bloom filters of the same length and same k hashing functions is just the bit-wise OR of its bitfield.
Possible use cases
Examples: could encode every WikiWord into a bloomfilter providing a very fast client-side test to see if a word should be linkable.
ISBN’s in a library: I’ve been meaning to hack together a Amazon powered better book search for my university, but how to efficiently determine if a particular book in a search of all books is at my local library without tons of API requests? I’d like to try a bloom filter encoding all books locally available.
Trying to avoid a Denial-of-Service attack. Say you’ve generated some large key space to keep your users data private but also shareable, (like the large random URL’s Google Reader uses). Someone could start guessing, and they’ll be guessing for a long time. Yes you should rate-limit them, but each request could possibly result in a database query too—why not use a bloom filter to quickly determine if that long random key is likely in your database without the performance hit?
I need to look into the replication code in CouchDB, I don’t know how it’s being implemented—but one CouchDB instance could furnish another instance a bloom filter representing the set of all documents and their versions in the database.
Peter Cooper coded up a ruby bloom filter and the ruby bit field that the bloom filter depends on. The bit field is implement as an array of 32-bit numbers. To index a specific bit you first look up the associated number in the array, then the bit offset into that number.
Coming up with k hashing functions
If you’ve got a good hashing function, you can use that hashing function to create more hashing functions, as many as you need in fact. The way I’m currently planning on doing this is to take a hash of the first hash, then a hash of the second hash etc. until there is enough keying material to produce k indexes.
Where CouchDB comes in
CouchDB implements views or calculated columns using map/reduce.
The map function will take a document, specify the size of the bloom filter, and specify the number of hashes (k) to use in that bloom filter (or an estimate of how many elements you’ll add and the accuracy you’d like to attain). Then whatever property is desired out of the document can be iterated over and added to the bloom filter.
The map function will emit a JSON document with an array of numbers (representing the bit field), a property specifying which hashing function is being used (skein, md5, sha1 etc.) and the number of hashes calculated for each item.
November 14th, 2007
MountainWest RubyCamp 2007
Saturday, November 17th 2007
1:00 PM – 4:00 PM
Salt Lake City Library
Level 4 Meeting Room
The organizers have asked that if you’re coming to put your name on the wiki page
I’ve never been to a “Camp” before but I hear they are like “unconferences”. I go to the Internet Identity Workshop which is an unconference, and the results have been fantastic. Those who come are actively involved in the discussion, it’s quite refreshing.
I don’t know if I will present or not. I could lead a discussion and get people up to speed with the digital identity landscape—OpenID, CAS, InfoCard and some secret sauce :)
If you’re interested but a little put off that it is specifically about “Ruby” you should come anyway. Ruby is a good excuse to get together and rub elbows, it’s not an excuse to exclude interesting people or ideas.
October 8th, 2007
To learn about distributed version control systems (DVCS), I’ve been using bazaar on some small projects . I know these DVCSs are supposed to really shine in multi-developer environments spread out across the world but I have found that they offer an incredibly low barrier to entry for a single developer on his own box. This is the typical situation for a college student in computer science and many others as well.An example could help here. I’m taking a networking class where a bit of code is provided in a framework and we need to use the code for our programs.
cd lab1 bzr init bzr add bzr checkin -m "Initial import"
Done. This folder is now under version control.
When version control is this easy to setup and use, there’s no excuse for not using it.
October 2nd, 2007
Some time ago I went on a church mission to Brazil for two years. I didn’t know anything about Brazil or Portuguese. We have a Missionary Training Center where I was inundated with non-stop Portuguese lessons for two months. It moved so fast it was hard to remember everything but it was a good preparation. I went from no knowledge of Portuguese to two months later being dropped off in the small town of MaracajÃº, Mato Grosso do Sul, Brazil. The other missionary working with me was Brazilian leaving me three hours away from anyone who could understand me. I thought I was learning a lot in the classroom but have since found that the pace of learning doesn’t even compare to complete immersion. I don’t like to go hungry. I learned how to speak.
After all is said and done, more is said than done. —Aesop
Reading and talking about it aren’t enough. You’ve got to do it yourself.
I gave myself a self-imposed immersion project. I wanted something small enough that I could have some quick success (to entice myself to stick with it) but still be big enough to be interesting.
When the term Ajax first started gaining momentum Jamis Buck blogged a piece about his first Ajax app , a simple word game. The back story was that there was a technical discussion at his work about the merits of asynchronous XMLHTTPRequest versus synchronous XMLHTTPRequest. His little word game was a demonstration to his coworkers showing that asynchronous was the way to go, so that it would not lock up the browser. It was a fun little game that both me and my wife enjoyed playing now and again.
Here it is, Devlin’s Word Reorder Game.
Some technical notes:
This is implemented using the excellent jQuery library. The tasteful effects courtesy of the Interface library. The dictionary is just all the four letter words contained in the file
/usr/share/dict/words on my Mac. I know, a lot of words you think should be in there aren’t (especially the short pluralized ones like tips, huts etc.). If you can get me a better dictionary I’ll glady switch.