How we scaled our Rails backend?

How we scaled our Rails backend?

In today’s post, I will cover what all we did to optimize to our backend system which is a Ruby on Rails application.

Background

AdWyze (now Clarisights) uses Ruby on Rails and as a data integration platform and we talk to multiple external sources. In fact, we keep fetching data from sources like Facebook, Google Analytics, Comscore DAX, AppsFlyer etc every 15 minutes. We have 65 unique background jobs that all run 2000 times per day to fetch and process data from external sources and/or process stored data in our databases: Postgres and MongoDB. With a Postgres DB with 100+ tables and 100M rows in all with a current size of 100GB growing at 0.5GB per day and the requirement to keep data fresh with a max lag of 15 minutes, most of our jobs need to finish quickly. This presented huge challenge in scaling our Rails app — both in terms of memory as well as how fast it runs. We have done a bunch of optimizations over the past 2 years to build a system that has scaled beautifully for us so far.

Optimizations

Reducing memory footprint

  1. Piping: have one thread fetch results from external sources and have another thread that processes them. This is much better than first fetching all results from external sources and then processing them. We used the ruby-thread gem for this that has a very simple way of creating a pipe.
  2. Batch read: Instead of loading a lot of data from DB via the Rails find method, we use find_each. Latter reads data from DB in batches and dramatically reduces memory footprint

Speed

  1. Threading — most of our jobs are I/O bound (they either fetch data from external sources or ready from DB), so one can easily parallelize by running different accounts data in parallel.
  2. Be lazy — do only work that is absolutely required e.g. if you know an object has not changed then why process it again. Empirical evidence showed us that 85% of our data does not change at any given time. So we avoided most of the hard work by not processing unchanged entities!
  3. Right Indexes: This is a no brainer but many folks miss it out especially when you have a lot of tables and you are doing many different queries on them. We used RubyProf profiler for profiling and Postgres’ EXPLAIN ANALYZE feature and its equivalent in Mongo to create the correct indexes. Postgres itself instruments index usage and scan data for each Postgres table in pg_stat_user_indexes table and that was also very helpful in finding tables that were missing indexes. Finally, pghero gem has been a hero for us — we found many other queries that were missing indexes.
  4. Parallel Writes: We noticed that once we had optimized our reads and our processing was happening in parallel, writes to DB became a bottleneck. We considered two options here: 1) Batch the writes to DB 2) Where batching was hard to do, we again introduced threads and did multiple writes in parallel. In most cases, we settled for the second option as it was the easier one to do.

Too many open connections

Threading introduces the problem of too many open connections to the DB. To solve this, we did a couple of things:

  • Gave up ActiveRecord connection as soon as we were done with a thread (by calling ActiveRecord::Base.clear_active_connections!)
  • added PgBouncer for our Postgres server
  • We didn’t do it explicitly but internally AcitveRecord uses a connection pool to pool connections to DB.

With all these optimizations in place, we have been able to make our background jobs super fast while making sure we do not need inordinate memory (all our VMs are < 8 GB RAM).

PS: We have not been able to find a good memory profiler for Rails and Ruby in general. RubyProf memory profiler does not work (requires a patched MRI) and other solutions like Sam Saffron’s memory_profiler have not been convenient either to find memory hoggers. If you know of a good one then please do share!

PS2: Yes, of course, there is the famous New Relic profiling tool but is super expensive. Hence, apart from their free trial, we didn’t bother to continue using them. Besides, we are not aware of it can find memory bloats in the code. We do use Datadog where we track all timing metrics as well as report our own custom metrics. Datadog is 1/10th the cost of NewRelic and is a good alternative to NewRelic as long as you are willing to put some effort to exploit it to your advantage!

PS3: Caching expensive stats queries didn’t work for us — we were using Redis to cache our stats queries. We serve real-time stats + live status of ads including their budget and bids and allow our customers to change the values in real-time. When we started caching, cache invalidation started becoming a nightmare and customers were confused that our platform is showing old data even though they changed it from our platform itself! Hence, we ditched caching completely.