Jasim A Basheer  

@jasim_ab.   I build stuff for the web.

January 31, 2014

Getting started with AngularJS in Rails

I have found that emails are how I get started with writing anything substantial that people find useful. The following post originated as an email to the London Ruby Usergroup when someone asked for pointers in integrating AngularJS with Rails. It was written in a hurry and is presented without any modification. Read on to listen to what I have to say about building an AngularJS project within Rails. I welcome your thoughts.

General

  • Use Coffeescript.

  • Start with Egghead.io. Read AngularJS by Brad Green, Shyam Seshadri. Spend a few hours and skim it cover-to-cover at least once. Angular documentation on the web is quite fragmented; a book is the best way to begin learning Angular.

  • If you aren't familiar with Underscore.js already, go through it. Add it to the project as soon as you begin. You'll need it.

  • Drop gem 'ng-rails-csrf' into your Gemfile to make Angular AJAX requests play well with Rails CSRF protection.

Directives

  • Some material on the web might lead you to think that a deep understanding of Directives is crucial to ‘get’ Angular. But for a beginner, the important thing is to just have fun doing things the Angular way (it is loads of fun, let me assure you). The basic directives Angular provides (eg: ng-repeat and ng-show) can go a surprisingly long way. Worry about transclusion, $compile, $digest and other arcane incantations only after you get comfortable with the basics.

  • One reason for you to write a directive is when you need to use existing widgets. A lot of popular widgets already have Angular wrappers. For example, we use the select2 wrapper for multi-select.

    There is also AngularStrap with a bunch of nice components that wraps over Twitter Bootstrap. The v1 (http://mgcrea.github.io/angular-strap/1.0/) version has Datepicker and Timepicker which isn't available in the new version yet.

  • You would hit a point when off-the-shelf directives won’t be enough and you'd want to do something that requires DOM manipulation. That would be a good time to start digging into Directives. You can also start exploring Directives when you start seeing patterns in your controllers that can be extracted out into reusable components.

Serialization

  • We use ActiveModel::Serializer. It is a safe choice.

  • Pay close attention to what you include in each serializer. Here is the obvious thing: you can put a bunch of has_many in your User model, and if you keep doing that for other objects, a render json: User.all will end up sending your entire db graph to the client. 

We implement a base serializer for most entities with some basic fields and no associations. When we need more data/associations, we sub-class them for that specific use case. Discourse follows a similar style - https://github.com/discourse/discourse/tree/master/app/serializers

  • Try not to do any decoration on the server side; it is tempting with all the view helper goodness Rails provides, but then you are splitting logic between client and server and it causes all sorts of headaches. You can use Angular filters as client-side helpers on some occasions. eg: {{ start_time | date }} prints out nicely formatted dates. 

Forms, validations and POJO form objects

  • Angular gives you validation attributes like ng-required, ng-minlength, ng-maxlength etc. on all form INPUTs. For simple use cases, this is the best way to go. But for a complex form, we ended up building an ActiveModel-like POJO Form Object to hold validation information, and for doing some slight transformations on the data before submission. We bound this object to the scope and removed a lot of clutter from our controllers that had begun to get bloated. We have to see how this pans out in the long run; I can give code snippets if you wish.

  • When I introduced Angular to our Rails project that was already under development, I tried to minimize the amount of change required by trying to keep form POSTs as they were.

    <input type='text' ng-model='post.title' name='post[title]'>
    <input type='submit' />
    

It worked for the most part, but we ran into some edge cases that wasted a lot of time. That was when I learned to let go and fully embrace the Angular way. We switched over to something like this:

<input type='text' ng-model='post.title' name='post[title]'>
<button ng-click='submitForm()'>

And the controller:

angular.module('myApp').controller 'postCtrl', ['$scope', '$http', '$state', 'flash', ($scope, $http, $state, flash) ->
  $scope.submitForm = ->
    $http.post(
      Routes.posts_path,
      $scope.post           # this is the ng-model hash we bound the input boxes to.
    ).success((data, status) ->
      $state.go("posts.show", {post_id: data.post.id})
    ).error((data, status) ->
      if data['message']
        flash.error = data['message']
      else
        flash.error = "Could not create Post. #{data}"
    )

# `flash`  is from https://github.com/wmluke/angular-flash.
# $state.go is ui-router. https://github.com/angular-ui/ui-router. More on that later.
# we use JSRoutes to access Rails routes from JS. Quite useful. https://github.com/railsware/js-routes/

The above code isn't ideal. You will be better off putting the $http.post inside an Angular service. This looks like a good post describing that: http://sravi-kiran.blogspot.in/2013/03/MovingAjaxCallsToACustomServiceInAngularJS.html.

Rails+Angular

  • Go all the way in. Don’t try to keep a mix of Angular and non-Angular pages. Except for Devise, all our pages are rendered client-side. We don't have to worry about SEO since this is an internal app.

  • Use ui-router, not the one that comes with Angular. The ui-router README could be confusing, but like most of Angular documentaiton, it becomes obvious in hindsight if you persevere.

  • We keep everything together in the same Rails project. Here is a rough structure:

    routes.rb:

    root ‘base#angular’
    resources :posts, only: [:index]
    

    app/views/base/angular.html.erb:

    <div ng-app='myApp'>
      <div ui-view></div>
    </div>
    

    posts_controller.rb:

      class PostsController < ApplicationController
        respond_to :json
    
        def index
          render json: Post.all, root: false
        end
      end
    

    app/assets/javascripts/routes.coffee.erb:

    angular.module('myApp').run ($rootScope, $state, $stateParams) ->
      $rootScope.$state = $state
      $rootScope.$stateParams = $stateParams
    
    angular.module('myApp').config ($stateProvider, $urlRouterProvider) ->
    
      # The default route
      $urlRouterProvider.when("", "/posts/list")
    
      $stateProvider
        .state("posts",
          url: "/posts"
          controller: "postsRootCtrl"
          template: '<div ui-view></div>'
          abstract: true
        ).state("posts.index",
          url: "/list"
          controller: "postsIndexCtrl"
          templateUrl: "<%= asset_path('posts/index.html.slim') %>"
          resolve:
            postsPromise: ($http) ->
              $http.get(Routes.posts_path())
        )
    

    app/assets/javascripts/posts.js.coffee:

    angular.module('myApp').controller 'postsCtrl', ['postsPromise', '$scope', '$http', (postsPromise, $scope, $http) ->
      $scope.posts = postsPromise.data
    

    app/assets/templates/index.html.slim

    div ng-repeat='post in posts'
      h1
        | {{post.title}}
      p
        | {{post.body}}
    

Quirks and bugs

  • It is worth keeping in mind that in JS {} == {} is false, since comparisons are by object identity, not by object value. You can do explict comparison using Angular.equals when you need it. But in some cases Angular does an implicit comparison, for which it doesn't use Angular.equals. For example, if you try loading a page with a bunch of pre-selected radio-buttons, if their values are JS Objects (Hash), Angular won't pre-select them since the comparison will fail. This is not an issue if your values are strings/numbers.

  • JS treats the keys of all objects as strings. a={1: "Hello"}; _.each(a, (value, key) -> console.log(typeof key)). If your server sends Hashes whose keys are numeric ids, they'll end up in JS with string ids.

  • A quick and dirty way to inspect an Angular model is to drop this in your Angular view template:

    pre
      | {{ post | json }}
    
  • There is also Batarang, a Chrome plugin which lets you inspect your scope interactively. But I've found it to cause my pages to misbehave in certain cases. I these days use angular-chrome-debug, which is quite light-weight. I load this script in my app in development env so that I don't have to go thru Chrome Snippets everytime.

  • Angular has Angular.copy for deep copy when you need it - https://github.com/angular/angular.js/blob/master/src/Angular.js#L725.

Well, that's mostly it. I am generally liking building rich UIs with Angular and wouldn't go back to building for the web without using data-binding.

Please keep in mind that the structure I described is working well for our app, which is not meant for public consumption. You might want to look at keeping your Angular and Rails project separately and utilize JS build tools like Bower, Yeoman and Grunt if your app needs that.

August 16, 2013

Ruby printf debugging made easier

Here is a debug method that can save you some keystrokes during a printf debug session:

    # Prints a set of variable names and their values. 
    #
    # Example:
    #     name = "Jack"; age = 10
    #     debug binding, %w(name age) # => "name: Jack, age: 10"
    #
    def debug(original_binding, vars)
      puts vars.map {|var| "#{var}: #{original_binding.eval(var)}"}.join(", ")
    end

May 30, 2012

15 Things for a Ruby Beginner

The following is a post I had recently sent the Bangalore Ruby User Group. It has been slightly modified to address a larger audience.

There were many Ruby beginners in last week's meetup, and the common question we heard was 'after the very basics, what next?'

The best way to learn Ruby best practices is to pair with an experienced dev; the way I learned was by inheriting a reasonably small, but well-written codebase from an amazing colleague. In the absence of either, here is a checklist of 15 things (since 'N things that you need to know about X' is the in-thing these days!) that I'd recommend a Ruby beginner to consider:

1. The very basics

Our very own rubymonk.com has a Ruby primer which was written for exactly this purpose; we open our inbox everyday to gushing feedback from people who've found it to be a great way to learn Ruby. Try it and let us know how it goes!

tryruby.org also has a basic introduction to Ruby, and has been around longer. Edgecase's Ruby Koan is an interesting concept, and covers the language both in breadth and depth, and is a very strong recommendation. It should take you anywhere between 5-10 hours to finish all of the Koans. Do try it!

I have heard good things about Learn Ruby the Hardway, but haven't tried it out myself. Okay, I just skimmed through portions of it and I'm not really happy - LRTH seems to be mostly a line-to-line translation of Python code to Ruby. It uses  'while' loop in places where equivalent Ruby idioms (Enumerables) would have made more sense. Also there is no mention of blocks, metaprogramming and duck-typing, which pretty much is a deal-breaker for me. But to be fair, the target audience for LRTH seem to be non-programmers for whom the concept of loops and objects would be new, and for them it does the job very well.

Wait, have you read Why's Poignant Guide to Ruby? If this is the first time you're hearing about why the lucky stiff, read this amazing piece on _why by the Smashing Magazine. And definitely read The Poignant Guide: http://mislav.uniqpath.com/poignant-guide/. It is full of cats, foxes, chunky bacon, cartoons that doesn't always make much sense, space travel and what not. This was one of my first introductions to the Ruby community, and the guide lent the language and the community a fun, quirky and happy aura. You may or may not take away much Ruby knowledge from the guide - I couldn't when I read it for the first time. However you'll definitely understand some of the quirkiness and philosophies that influence the Ruby community. I am a huge fan of _why, and here is my favourite quote: 

when you don't create things, you become defined by your tastes rather than ability. your tastes only narrow & exclude people. so create.

2. The ecosystem - RVM/rbenv, RubyGems, Bundler, Git and Github

I think all of these tools are mandatory for being a productive Ruby programmer. You'll encounter them soon enough:

- RVM/rbenv: these are tools used to manage multiple Ruby versions on the same machine. I've been using RVM without complaint for quite a while, though there are people who will go up in arms against RVM for various reasons. As a beginner, you should be comfortable with either.

- RubyGems: a gem for anything, a gem for everything. If you are using RVM, it will install RubyGems by default for you. http://docs.rubygems.org/read/chapter/4

- Bundler: You'll learn it easy enough if you are using Rails. But even for non-Rails projects, Bundler is now the de-facto tool to handle gems and dependencies. It is one of those tools that when you see for the first time you would wonder how you ever lived without it.

- Git: You are a git if you don't use git yet. If you are not even using any version control at all, good for you - there aren't bad practices that you need to unlearn. If you are on SVN, or God forbid CVS, jump now. http://git-scm.com/book/en/Getting-Started-About-Version-Control

- Github: You have a Github handle, right? 'nuff said.

3. Editor

I don't care. Pick one, use it well. If you're on Vim and is on Insert mode all the time, then use Notepad instead. It would be more productive. Learn your editor.

Here is a list of editors/IDEs people generally use for Ruby development:

- Sublime Text
- Textmate
- RubyMine (My favourite, but needs a lot of memory and CPU)
- Vim
- emacs
- Aptana RadRails (recently saw a couple of people using it. don't know how good it is)
- Redcar (I've used it very occassionaly, am yet to see someone using it as a primary editor)

If you are using Sublime Text, install and use its corresponding Ruby package. Ditto for Textmate.

If you are on Vim, using the right set of plugins is a requisite to be productive. There is the popular https://github.com/carlhuda/janus and Srushti's https://github.com/srushti/vim-get which I use when I work with Vim. Even if you don't go for these plugin distributions, spend enough time to find the right plugins for Ruby development.

I don't know about the best plugins for emacs, but there are people who use emacs to develop in Ruby. Even Matz uses emacs; search and you shall find.

4. Ditch 'while', 'for' and array accumulation

Read this: http://martinfowler.com/bliki/CollectionClosureMethod.html

An apparent sign of a programmer who does not know Ruby well enough is code that uses 'for' and 'while' loops for iteration and accumulation. I am hardpressed to remember occasions where I had to use them instead of the Enumerable methods #each, #map, #select, #inject, #reject and #detect. Learn these methods, chew on them, and use it everywhere!

(infinite loops are almost always written using the loop do..end construct though. but how often do you write infinite loops anyway?)

5. Hash

At the time when I started writing Ruby, the languages that I had written in for a reasonable period of time before were CA-Clipper, Borland Turbo C and some VB 6.

The first two did not have a hash, associative array or dictionary - whichever you prefer to call it. As to VB6, the only thing I can remember is DataGrid and ADODB. Ah, the failed promises of drag and drop programming!

So Hash was a revelation and I started using it anywhere and everywhere. Do you want to build a CRUD app to manage customer info? Forget databases, I'll build a Hash and serialize to and deserialize from a YAML file. There were even more crimes committed using Hash that I dare not mention here. You would have gone through enough exercises that uses Hash when working through RubyMonk or Ruby Koans. But if you haven't, make sure you understand Hashes well enough. Specifically:

- iterating over a hash
- assigning default values for undefined keys in a hash
- Hash#keys and Hash#values for extracting just the keys and values
- In Ruby 1.8 Hashes are un-ordered: ie, you can't rely on the ordering of the hash to be same as the order in which you added elements. In Ruby 1.9, a Hash is sorted on the basis of order of insertion.

6. JSON and YAML

These are not Ruby specific concepts, but find great use in the ecosystem. Know them well, they'll come in handy.

7. Understand Immutability and how Ruby passes object references around

This has slightly got to do with the above point - all the Enumerable methods are immutable, and it is a good introduction to how functional Ruby veer towards immutability.  Immutability is more of a good programming practice than a Ruby specific idea - it helps you write clean predictable code, leave aside concurrent programming and race conditions. A method that mutates its parameter is a vile creature, don't bring it forth into existence.

If you come from a C programming background, building new objects willy-nilly would be a little hard to digest. So much memory put to waste! I remember reading somewhere that programmers who use high level languages leave a higher carbon footprint because their code is inefficient. I leave you to ponder over it.

For understanding some quirks around Ruby's immutability and interesting effects of how Ruby passes object references around, figure out where Array#clone is used. I remember wasting many a debugger breakpoint during my early days of Ruby because I didn't realize this. Don't let that happen to you! Understand the difference between a shallow clone and a deep clone. Even better, go write your own deep_clone routine! (limit yourselves to objects that can have strings, numbers, arrays and hashes)

Also read: http://ruby-doc.org/docs/Newcomers/ruby.html#objects

8. Ruby's Object Hierarchy

# All objects are instances of the class Object. 
"a string".is_a? BasicObject         # true

# All classes are instances of the class Class.
String.is_a? Class                   # true

# Class is a subclass of BasicObject.
Class.is_a? BasicObject              # true

# Class is not an instance of BasicObject
Class.instance_of? BasicObject       # false

# BasicObject is an instance and a sub-class of Class
BasicObject.is_a? Class              # true
BasicObject.instance_of? Class       # true

Okay, I lost it. It is pretty crazy: http://stackoverflow.com/questions/4967556/ruby-craziness-class-vs-object. As a beginner, you wouldn't need to understand the nitty-gritties. I've been programming in Ruby for about three years now, and it still confuses the heck out of me.

For now it is safe to understand that BasicObject is usually the root object of all objects in Ruby. And everything in Ruby is an object. This has a very useful side-effect (try this in IRB):

"some random string".methods - Object.methods
Also,
Array.new.methods - Object.methods

The above commands will show you methods that are specific to just strings and arrays, excluding all methods that are always present in any Ruby object (inherited from Object -like instance_of, is_a? etc.).

You might have noticed that the '-' operator gives you the difference between two arrays. Whenever you need a general purpose method and wonder whether Ruby comes with it, just try some plausible syntax in IRB. You might be surprised at what you find.

Even though Ruby lets you Object Oriented and procedural code, the language leans toward OO. Ruby treats even methods as objects:

"some string".method(:length) # gives you an object of the Method class.
The method object can be asked to run by invoking the 'call' method on it.

9. Creating your own Objects

Did you notice that the title wasn't 'Creating Classes'. That was one of the most useful advices I've ever received: Always think in terms of objects - not classes. Thinking in terms of Classes can subtly make you evolve your design upfront. Don't. Let your objects guide you in how your class definition should look. As a rough analogy, when building a home, the blueprint is valuable only as a reference for building the actual home. You imagine what your home should look like and draw a blueprint accordingly, not the other way round.

Start with sparse classes, add methods and attributes as your objects demand it. Srushti puts it better: Imagine you're an instance, and think about what you want to do and how you want to do it. You don't want to give up your secrets (encapsulation). You don't ask other people for information so you can do their work for them, you just tell them to do stuff for you (tell, don't ask)

Ruby has a very simple syntax for defining classes and building objects. If you come from a Java/C# background, it'd be the first thing you look for. But even if you are a die-hard procedural ninja, trust me, thinking in terms of objects will help you write better programs, tackle complexity and be a more capable programmer.

So, what are the things that are specific to Ruby that you need to be aware of?

- Message Passing. "abcd".length is in fact "abcd".send(:length).
- Module vs Classes (hint: they're very similar, but you can't instantiate a Module)
- Mixins (Ruby's answer to multiple inheritance and the greatest thing _before_ sliced bread)
- attr_reader, attr_writer, attr_accessor.
- instance methods and class methods
- instance variables and class variables.

And we all know that you don't use class variables unless you have a very good reason. Class methods aren't that bad, but are usually a smell. Whenever you find yourselves writing a class method, take a step back and make sure it can't be rephrased as an instance method, perhaps in a child object?

There is a lot more to OO, some less specific to Ruby. As you go deep into the rabbit hole, ponder over these blanket statements:

- Primitives (Hash, Array etc.) are evil! Build objects.
- Inheritance is evil! Use Composition.
- Conditions (if..else, switch..case) are evil! Use Polymorphism.

10. Ruby is interpreted. It is malleable. Use that to your advantage.

Interpreted programs are almost always slower than native code (which includes JIT). By choosing to use such a language, you are accepting a compromise in the speed/efficiency of your programs. But this gives you a great advantage: the flexibility to change your code at runtime. Though we can't claim 'code is data, data is code' like those hipster LISPers do, there is tremendous power in the dynamism (no reference to type systems) of Ruby. Learn it, use it, change the world!

I had briefly mentioned the 'send' method that is available for every object in Ruby:

   "abcd".send(:length) 
is same as
   "abcd".length

That means you can do things like this:

puts "Hi, which method do you like to invoke on a string today?"
method_name = gets.strip
puts "a random string".send(method_name)

Did you see that? Unlike fully compiled languages like C/C++, Ruby lets you call arbitrary methods during runtime! (you can pass arguments to the method as parameters to the 'send' method)

Leave aside calling arbitrary methods, running arbitrary code during runtime is a breeze:

puts "What code do you want to run today, dear sir?"
arbitrary_code = STDIN.read         # press ctrl+d to stop input
eval(arbitrary_code)

Try it, type in some short valid Ruby code and see it in action.

Now that you know 'eval' exists, forget about it. It is too dangerous to be almost ever used. It is unsafe and unscoped, but there are better things to achieve similar and useful results. The point of this exercise though was to see Ruby's dynamic nature in action. Since Ruby is interpreted, there is no limitation on what can be done during runtime. This can be used to great good as we will see in Metaprogramming.

11. Metaprogramming

Metaprogramming in Ruby more or less gives you ways to create/remove/redefine methods at runtime. If you have used Rails, you would have seen that you would write something like

class User < ActiveRecord::Base
end

and magically, the User class gives you methods like user.name, user.find_by_name, user.find_by_name_and_id. Depending on the fields in the database, Rails defines methods for you to use. This uses Metaprogramming where Rails defines the methods at runtime after consulting the table schema.

(talking about 'magic', usually when someone complain about 'magic' in Ruby code, she is most probably referring to some sort of metaprogramming in the code)

Metaprogramming is one of Ruby's most powerful concepts (anything borrowed from FP is yummy!), but it is open to use and abuse. They say that someone who knows metaprogramming well enough, but not enough to know where not to use it, is a danger to himself and society. The internet is rife with discussions around it and you'll find no shortage of flame wars, opinions and thankfully, documentation.

These are the methods you would want to look up to get a decent overview of metaprogramming in Ruby:

- define_method
- method_missing
- instance_eval
- class_eval

I would also recommend Yehuda Katz's excellent explanation of Metaprogramming by relating it to the context of 'self': http://yehudakatz.com/2009/11/15/metaprogramming-in-ruby-its-all-about-the-self/

Metaprogramming is a bit advanced, and if you don't understand all or any of it the first time, don't worry. Come back and take a look again later. Rinse and repeat. It is an acquired taste, give it time!

12. Closures (Blocks, Lambdas et al.)

Blocks are my favourite. They move mountains. Rather, they let you write beautiful DSLs when coupled with the right dose of metaprogramming. Have you seen factory_girl's syntax? It is an unholy mix of method_missing and 'yield'.

Factory :user, aliases: [:author, :commenter] do
	first_name		 "John"
	last_name		 "Doe"
	date_of_birth { 18.years.ago }
end

It is not really hard to build a DSL that reads like this, and there is no better resource to learn all of this than http://rubysource.com/functional-programming-techniques-with-ruby-part-ii/. The first part of that series looks into the functional and immutable aspects of Ruby, and is also a recommended read: http://rubysource.com/functional-programming-techniques-with-ruby-part-i/

13. Styleguides

Whenever you are in doubt, or the self becomes too much with you, go read the Ruby style guides.

Github's simpler style guide: https://github.com/styleguide/ruby

The comprehensive one: https://github.com/bbatsov/ruby-style-guide

14. Simplicity is virtue

Knowing what constructs to use where is a matter of knowledge and experience. Every approach has trade-offs in terms of readability, maintainability and efficiency. The battle between these have been the recurring theme in the battles programmers fight in their heads for years. Knowing the the trade-offs will help you make more informed decisions, but it might not always be enough. Some things need to be tried, tested and failed, and that is fine.

But be vary of Premature Optimization. When you have a choice between clever, short and maybe faster code Vs longer but readable code, go for readability.  Ruby makes it easy to write really bad code that people would fear to touch with a long pole. It also lets you write  beautiful and concise code. When you contemplate between the two, remember the joke about the psychopath who'll inherit your codebase, knows where you live, and pings you from your local network! The choice is yours.

15. None of this matters

If you are overwhelmed by this document or any links referenced from here, just ignore it. Remember the 10,000 hours rule. Happily go about writing code the way you know best! And write a bit more code. Try to pair with someone who knows things a bit more. Go read some well-written Ruby code from Github. Then come back and see what you've learned.

None of this is rocket science, but it takes time and practice for concepts to sink in, and that is just fine.


Have further questions? There are tons of resource on the internet to answer your questions!

Join one of your local Ruby Usergroups. The Ruby community is extremely helpful and accomodating towards newbies. Check this page to locate a usergroup near you: How To Find Ruby User Groups

Participate in the usergroups, ask your questions. Also hop on to Ruby's IRC channel #ruby-lang on Freenode. Irrespective of the forum, just make sure that you give enough context about your question to help others understand your problem. If you haven't read ESR's "Ask questions the smart way" yet, *this* is the time. Go read it now and get enlightened on the ways of the interwebs! 

And remember to have fun! In Matz's own words:

"For me the purpose of life is partly to have joy. Programmers often feel joy when they can concentrate on the creative side of programming, So Ruby is designed to make programmers happy."
Happy hacking!

October 26, 2010

B-Trees, large volumes of data and branching factor.

B-trees can be slow.

I've noticed that after a few million keys, almost all b-tree based databases starts to get slower. This applies to both Relational/ACID-compliant as well as the NoSQL DBs. I tried Harbour's BTree implementation, TokyoCabinet, SQLite and MySQL. All of them started to slow down at various stages and the maximum I could go without an unacceptable decrease in performance was close to 40 million keys with TokyoCabinet in my 2GB RAM laptop.

Large branching factor = Better b-tree performance.

A binary tree is a tree which has two children per node, and has a time complexity of O (log2 N) for any search operation. B-tree is a variation of binary tree in which there are M number of children per node. It thus achieves a time complexity of O (logM N) for each search operation. M is called fanout or Branching factor. A large branching factor is the reason why the b-tree is a fast data structure.

When the number of keys (N) crosses a certain bound, the height of the tree will increase and every search/insert might need to traverse one more level. If this level is not in memory, the operation will need a disk access. Since disk access is expensive, this leads to bad performance. Thus,

Efficiency of a B-tree ∝ Available memory / F(N).

F(N) is a function on the number of keys and gives the height of the b-tree.

This leads us to the branching factor - given enough memory, increasing the branching factor will make each level of the tree hold more nodes and thus reduce the b-tree's height. Fewer levels means fewer disk access (if any) and better performance.

Beyond a certain level, increasing the branching factor = decreased performance

In an ideal world, to cope with larger values of N, we could simply increase the branching factor and individual node capacity to maintain the performance of the B-tree.

But the b-tree structure becomes a burden if the branching factor is so big that the first few levels cannot be completely held in memory. At this stage, accessing each level can trigger a page-fault causing disk swapping and very poor performance.

Two major concerns here are :

  • Limited Memory: The b-tree parameters are limited by the available memory. Since all search starts from the top, we need the first level of nodes to be completely in memory. More the number of nodes thats fits in main memory, the lesser the need for disk seeks. But scaling physical memory is an expensive proposition and infeasible beyond a point.

      Solution: Sharding

      Beyond splitting db load, there is a huge advantage with sharding - sharding a b-tree storage enables us to limit the number of keys that a single b-tree need to hold. Thus each b-tree will produce optimal performance.

      Sharding the database along with its b-tree indices into different physical machine is one of the most effective scaling solutions for growing data volumes.

      MongoDB uses this approach, and seems to be scaling very well judging from a few instances of MongoDB in production use.

  • Databases need constant tuning: Changing the database system parameters (branching factor, shards, indexes) etc. usually require downtimes. Having scaling related down-times is a major pain and we are still yet to see a db that tunes itself to increasing data volumes.

      Solution: Cache-oblivious structures and algorithms

      The implementation of most algorithms that deal with large volumes of data makes advantage of the caching characteristics of the target machine. For example in B-trees, the branching factor should be such that a complete level of node should completely fit in the main memory. Cache-oblivious algorithms remove the dependency on the cache characteristics of the platform and implement data structures like b-tree to perform well in diverse memory architectures.

      Prof. Charles E. Leiserson of MIT heads the Supercomputing technologies group which is doing pioneering research into cache-oblivious algorithms. There are quite a few papers on this topic at the MIT SuperTech website.

      The primary authors of the cache-oblivious B-tree papers at CSAIL, Michael A. Bender, Martin Farach-Colton and Bradley Kuszmaul have founded Tokutek, which built a product based on cache-oblivious b-trees - TokuDB an add-on database engine to MySQL that implements fractal tree which is claimed to perform orders better than either InnoDB/MyISAM at high data volumes. Though shipped with MySQL, the implementation is not open-source.

Conclusion

B-trees are one of the most used data-structures in computing. However they are not a silver-bullet for every kind of data storage and retrieval problem. Developers still need to go down a few levels of abstraction and tune the database to gain the most from the DB. Sharding is a viable solution but it requires the application to be tailored to shard the dataset according to properly identified key fields. Cache-oblivious b-trees are yet to become widely available in mainstream databases.

It has been about 30 years since the first commercial RDBMS (Ingres) by Michael Stonebraker became popular, and the relational model of database with b-tree as its preferred storage structure chugged along. However in recent years, the demands of realtime web-scale applications have paved way for more research and changes in the database landscape.

September 29, 2010

C dynamic memory allocation gotcha

This is a common beginners mistake I've seen: functions returning pointers to automatic variables

Before we dig in, consider this:

char result[10];  // Is this same as the next one?
char *result = (char *) (malloc(sizeof(char)*10));  

The first declaration creates an automatic variable. Its scope is restricted to just the function where you declare it. You are not supposed to use the variable or any pointer to the variable after the function terminates.

The second is a pointer to a character variable. The pointer now holds the address of the memory allocated using the malloc function.

Consider this function:

char *answer_to_everything() {
   char result[3];
   strcpy(result, "42");
   return &result;
}

For those who haven't yet gone through dynamic memory allocation hell with C, this would look innocuous. It even works sometime! But consider the scope of the variable result: it is declared inside answer_to_everything() and as soon as it ends, C considers any memory statically allocated for any variable inside the function including result to be disposable. It does not care whether you've hoarded a pointer to its memory address for later use.

What if you need to operate on that memory even after its scope is over? That is why we've malloc.

The right way to do this would be to explicitly allocate memory and free it yourself.

char *answer_to_everything() {
   char *result=(char *) malloc(sizeof(char)*3));
   strcpy(result, "42");
   return result;
}

Here since you've explicitly asked C to allocate memory for you, it won't meddle with your affairs. Your memory is your own and you've the responsibility to free it when you no longer need it.

If you were using statically allocated memory, C would have freed it for you after its scope ends. Here however we have not 'free'd the allocated memory. What happens then? If you don't explicitly call free on allocated memory, you create a memory leak.

Have you noticed that the standard C string functions (strcpy, strcat etc.) require a destination variable where the result will be stored? None of these functions allocate memory themselves. They ask the caller to provide them with a buffer with enough memory allocated. Why couldn't those functions allocate the required memory and save us the pain?

Here is why:

void meaning_of_life() {

        printf("The ultimate answer to life, the universe "
                "and everything is %s", answer_to_everything());
        // answer_to_everything would allocate the required 
        // memory and give us a pointer to the result.
        
        return(void);
}

We know that answer_to_everything() allocated some memory explicitly. But we don't have even the address to the allocated memory. Who is going to free it? Instances like this, my friend, is the reason why we don't let functions to allocate memory but evade the responsibility of freeing it.

The rule of thumb is: any function that dynamically allocates memory has to free them itself. If your function needs access to buffer memory (like strcpy, strcat et al.), you ask the caller to allocate them and give it to you.

So finally we have the ultimate function that gives you the answer to the life, universe and everything else:

char *answer_to_everything(char *result_holder) {
   strcpy(result_holder, "42");
}

void meaning_of_life() {

        char *result=(char *) malloc(sizeof(char)*3));
        answer_to_everything(result);
        
        printf("The ultimate answer to life, the universe "
                "and everything is %s", result);
                
        free(result);
        return(void);
}

August 27, 2010

A Quine in Ruby

A very short Quine in Ruby.

A Quine is a program that prints itself. Though a little tricky at first, it is a rather simple puzzle.

q="q=#; puts q.sub('#',q.inspect)"; puts q.sub('#',q.inspect)

August 07, 2010

jQuery Multi-Number plugin

Here is a simple jQuery plugin that accepts multiple numbers in a single textbox and delimits each one with a delimiter of your choice.

Test the plugin in this box. Type a few numbers and press space or comma to delimit them.


The code is hosted at GitHub