MemSQL ships 2.0. Scales in-memory database across hundreds of nodes, thousands of cores.

When we started MemSQL a little over two years ago, our goal was to deliver the fastest OLTP database ever. Inspired by the scale and architectures we saw at Facebook, we hoped to help every enterprise leverage in-memory technologies similar to those that leading web companies use.

As we worked with our early customers, we saw that an in-memory solution could provide the greatest value by enabling users to analyze real-time and recent historical data together. Customers like Zynga and Morgan Stanley not only wanted to quickly commit transactions to the database, they also wanted instant answers to questions about how their real-time data compared to historical data. This inspired us to build something new – a solution that supports highly concurrent transactional and analytical workloads at Big Data scale.

That brings us to today. We’re proud to announce that MemSQL’s real-time analytics platform is available for download. This is the first generally available version of MemSQL that scales horizontally on commodity hardware. It provides the blazing fast performance for which MemSQL is known, and now does it at Big Data scale. Customers have deployed MemSQL across hundreds of nodes and dozens of terabytes of data, and we’ve tested at even greater volumes and velocities. (Check out our calculator to get an idea of the number of reads and writes you can perform depending on the size of your cluster.)

This is also the first version to include MemSQL Watch, a visual web-based interface for monitoring and managing your cluster. We expect this to be the beginning of our foray into real-time visualizations as many of our customers look to operationalize their analytics.

Deploying a database can be difficult, so we’ve made it as simple as possible. Download MemSQL for free on our site and take it for a spin. You’ll definitely be impressed by the performance, but you’ll also be impressed by what’s missing:

  • Batched loading – Don’t wait until the middle of the night to refresh your reports.
  • Complicated programming languages (and a limited talent pool) – Use SQL for real-time analytics.
  • An expensive, proprietary box (and a plan to rip and replace it in a few years) – Scale incrementally on commodity hardware.
  • A lengthy implementation cycle – Launch your first MemSQL instance in minutes in the cloud.

We’re proud of the progress our engineering team has made building out an enterprise-grade software solution. Stay tuned to this blog to learn more about the real-time analytics challenges we are helping customers conquer. More to come.

Practical Techniques to Achieve Quality in Large Software Projects

High quality is hard to achieve and very expensive, but it’s worth every penny and must be taken extremely seriously. There is no silver bullet – just lines of defense. The good news is with the proper lines of defense, quality becomes incremental. It only goes up with every release. With enough test coverage and quality tools you can substantially increase the quality of your product and protect it from regressions.

Continue reading

Common Pitfalls in Writing Lock-Free Algorithms

failstack.jpeg

Formally, a multi-threaded algorithm is considered to be lock-free if there is an upper bound on the total number of steps it must perform between successive completions of operations.
The statement is simple, but its implications are deep – at every stage, a lock-free algorithm guarantees forward progress in some finite number of operations.
Deadlock is impossible.

Continue reading

Where Should TopCoders Work. One Year Later.

A little over a year ago we published a blog post “Where Should Top Coders Work?”. In a few weeks we were approached by topcoder.com and offered to sponsor TopCoder Open. We hired two amazing folks after that event, and two more joined as interns. With a handful of red TopCoders working on the engine, we’ve had a great relationship with the TopCoder community.

When I published the original blog post, I had absolute faith into the TopCoder brand, but we had nothing to really back it up. Since a year has passed, I thought it’d be worthwhile to share what our Top Coders have been working on.

What have TopCoders built at MemSQL?

Here at MemSQL, engineers have a lot of freedom in deciding what to work on. Instead of saying “we have a great team solving hard problems,” I’d like to share what each TopCoder has worked on since joining the company.

Pieguy. David fell in love with concurrency and lock-free data structures. He has built/improved our entire lock-free story.  David is also great at finding tough race conditions by studying the code and investigating various concurrency scenarios including finding a bug in a well-known, lock-free paper (http://www.research.ibm.com/people/m/michael/podc-1996.pdf). TopCoder trains people to read and understand code very well.

Nika. Nika tackled the challenges in implementing joins and subqueries, which included building an updated optimizer, an artificial intelligence component that decides what indexes and what join order to use in a complex query.

Momchil. As an intern, Momchil worked on various components of the codebase, but he really got hooked working on distributed query execution. By the end of his internship MemSQL was able to run queries like the one below in a distributed environment over a 100 node cluster (actual use case):

select x.i, x.a, x.b, o.c, o.x from (select * from t) as x left join (select c, count(*) as x from ob group by c) o on x.a = o.c

Check out Momchil’s blog in which he writes about his experience at MemSQL.

Decowboy. While interning at MemSQL, Jelle worked on replication. He helped with the robust design and implemented major components of the feature. This is a highly technical project that involved transaction log shipping and recovery on the slave, failover, and resiliency to all kinds of DDL (data definition language) such as create/drop table and create drop database, etc.

SkidanovAlex. Alex is the first MemSQL TopCoder and he has contributed to all the major components of MemSQL. He has been a crucial contributor to MemSQL’s distributed product.

The list can go on and it’s just the highlights of what the TopCoder crew has built.

What can TopCoders learn at MemSQL?

Databases are complex at every level: storage engines, optimizers, parsers, distributed systems – things most regular developers don’t work with on a day-to-day basis.

In a famous blog post (http://herbsutter.com/welcome-to-the-jungle/), Herb Sutter claims:

  • Applications will need to be at least massively parallel, and ideally able to use non-local cores and heterogeneous cores.
  • Efficiency and performance optimization will get more, not less, important.
  • Programming languages and systems will increasingly be forced to deal with heterogeneous distributed parallelism.

MemSQL is at the forefront of building massively parallel software. For engineers who possess a solid foundation in problem solving with algorithms and data structures, MemSQL is a great place to build these skills.

Why do TopCoders succeed at MemSQL?

Apart from the obvious such as an incredibly high IQ, excellent problem solving skills, and proficiency in algorithms, there is one trait that we find very attractive: TopCoders love to code. If you code a lot every day it generates a compound effect. You are getting much better and more productive.

Startup vs Big Company

Growth and opportunity are crucial for a successful engineering culture. Unfortunately, big company environments often hinder growth potential because of bureaucracy and a lack of emphasis on engineering culture. We are lucky to have very hard problems in spades and have no bureaucratic roadblocks for a TopCoder to grow and advance his or her career.

Are we going to hire more TopCoders?

We are happy to announce that we are sponsoring TopCoder Open once again and to keep our doors open for more amazing engineers coming out of this unique community.

Being a TopCoder is not a prerequisite for working at MemSQL, but being a great engineer is. Please find us at careers@memsql.com

MemSQL: My First *Real* Startup Experience

This is reposted from Momchil Tomov’s blog. Momchil was part of the first summer batch of MemSQL Interns.

After seemingly stumbling into their office by accident, getting interviewed on a lark, and receiving an offer as my Christmas gift, I kept an open mind for what to expect from MemSQL. The one-year-old YC alum was set to build the world’s fastest database, leaving competitors like MySQL and MongoDB in the dust. Very ambitious indeed.

From day one, I was thrown in the fire pit. Someone had to finish the Workload Simulator, an important developer tool, before the launch. Everyone was super busy fixing bugs and polishing the product, so I had to do a quick Javascript/Flask/Python crash course and jump on it. After eight long nights, a thousand lines of code, and a huge amount of support from Ankur, my mentor, the Workload Simulator was finished to coincide with launch.

I spent the next month working on code generation: writing code that compiles SQL to C++, the core strength of MemSQL. Once I got acquainted with the codebase, Ankur decided to step up my game with data compression. I had to find a way to compress 10 TB of random strings down to at most 5 TB without compromising the speed of the transactions, and I had two weeks to do so. After some research, experimentation, and lots of dumb luck, I managed to bring it down to 2 TB. It required all the MemSQL knowledge I had obtained thus far, including things I learned during my interview.

Just as I finished the first prototype, I joined the team working on sharding. Sharding is what enables MemSQL to get distributed across multiple machines, an important feature of every scalable database and also an important feature for our biggest clients. I was onboarded on Sunday, began work on Monday, and shipped my first code review on Tuesday. One of the dev leads, Ankur, gauged my progress, and kept assigning me more and more sharding tasks, slowly walking me through the implementation of Distributed MemSQL. Just as a father silently lets go of the back seat of his kid’s bicycle, he slowly backed away and moved on to testing and bug fixing. Before realizing it, I found myself working on the team with no safety wheels, full speed ahead, pushing the forefront of MemSQL. I was no longer an intern. I was a full-time MemSQL engineer.

They say Princetonians work hard and play really really hard. If that’s true, then MemSQL felt even more like home. From the extraordinarily delicious meals prepared by our in-house chef Daniel, to the post-launch Las Vegas Celebration Trip, to the relaxing Saturdays on Vasili’s boat underneath the Golden Gate, the MemSQL team knows how to have a good time. As MemSQL is growing, I sincerely hope they maintain their great culture of ridiculous engineering and chill free time. I had an amazing summer many thanks to the incredible people at 380 10th Street. I wish Eric, Nikita, Marko, Vasili, Alex, Masha, Nika, Pieguy, Adam, Ankur, Daniel, et al. all the best and hope they build a company 300x better than the rest!

Go MemSQL!

Loading half a billion records in 40 minutes

Disclaimer. This will not be an apples to apples comparison with derwik, since we obviously don’t have the same dataset, and we need a much bigger machine to load everything into memory. But I believe this experiment will get the point across. So without further ado, let’s go through the steps.

Adam Derewecki wrote a cool post about his experience loading half a billion records into MySQL. MemSQL is a MySQL-compatible drop-in replacement that is built from the ground up to run really fast in memory.

Disclamer: this isn’t an apples-to-apples comparison with derwik since we don’t have his dataset and need a much beefier machine to load everything into memory.

Schema

Here is the schema for the table. Note that the table has three indexes. On top of it memsql will automatically generate a primary index under the covers. We won’t be disabling keys.

drop table if exists store_sales_fact;
create table store_sales_fact( 
       date_key                smallint not null,
       pos_transaction_number  integer not null,
       sales_quantity          smallint not null,
       sales_dollar_amount     smallint not null,
       cost_dollar_amount      smallint not null,
       gross_profit_dollar_amount smallint not null,
       transaction_type        varchar(16) not null,
       transaction_time        time not null,
       tender_type             char(6) not null,
       product_description     varchar(128) not null,
       sku_number              char(12) not null,
       store_name              char(10) not null,
       store_number            smallint not null,
       store_city              varchar(64) not null,
       store_state             char(2) not null,
       store_region            varchar(64) not null,
       key date_key(date_key),
       key store_region(store_region),
       key store_state(store_state)
);

 

Hardware

We have a very cool machine, with 64 cores and 512 GB of RAM, from Peak Hosting . You can rent one for yourself for a little under two grand a month. They were kind enough to give it to us to use for free. Here is a spec of one core.

vendor_id       : AuthenticAMD
cpu family      : 21
model           : 1
model name      : AMD Opteron(TM) Processor 6276
stepping        : 2
cpu MHz         : 2300.254
cache size      : 2048 KB

You read that correctly, this machine has sixty-four 2.3 GHz cores and 512 GB of RAM or almost 8 times the largest memory footprint available in the cloud today, all on dedicated hardware with no virtualization overhead or resource contention with other unknown third parties.

Loading efficiently

Loading data efficiently is actually not that trivial. The best way of doing it with MemSQL is to use as much CPU as you can get. Here are a few tricks that can be applied.

1. Multi-inserts

We can batch inserts into 100 row multi-inserts. This will reduce the number of network roundtrips. Each roundtrip now accounts for 100 rows instead of one. Here is what multi-inserts look like.

insert into store_sales_fact values('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #6 chicken noodle soup','SKU-#6','Store71','71','Lancaster','CA','West'),
('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #5 golf clubs','SKU-#5','Store71','71','Lancaster','CA','West'),
('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #4 brandy','SKU-#4','Store71','71','Lancaster','CA','West'),

2. Load in parallel.

Our customer has a sample file of  510,593,334 records. We can use the command line mysql client to pipe this file into MemSQL, but this would not leverage all the cores available in the system. So ideally we should spit the file to at least as many chunks as there are CPUs in the system.

3. Increase granularity

Splitting the file into 64 big chunks will introduce a data skew. The total data load time will be the time the slowest thread loads the data. To address this problem we will split the file into thousands of chunks. And every time a thread frees up we will start loading another chunk. So we split the file into 2000 chunks.

 1 -rw-r--r--  1  35662844 2012-06-06 11:42 data10.sql
 2 -rw-r--r--  1  35651723 2012-06-06 11:42 data11.sql
 3 -rw-r--r--  1  35658433 2012-06-06 11:42 data12.sql
 4 -rw-r--r--  1  35663665 2012-06-06 11:42 data13.sql
 5 -rw-r--r--  1  35667480 2012-06-06 11:42 data14.sql
 6 -rw-r--r--  1  35659549 2012-06-06 11:42 data15.sql
 7 -rw-r--r--  1  35661617 2012-06-06 11:42 data16.sql
 8 -rw-r--r--  1  35650414 2012-06-06 11:42 data17.sql
 9 -rw-r--r--  1  35661625 2012-06-06 11:42 data18.sql
10 -rw-r--r--  1  35667634 2012-06-06 11:42 data19.sql
11 -rw-r--r--  1  35662989 2012-06-06 11:42 data1.sql

4. Load script. The load script uses python multiprocessing library to load data efficiently.

 1 import re
 2 import sys
 3 import os
 4 import multiprocessing
 5 import optparse
 6
 7 parser = optparse.OptionParser()
 8 parser.add_option("-D", "--database", help="database name")
 9 parser.add_option("-P", "--port", help="port to connect. use 3306 to connect to memsql and 3307 to connect to mysql", type="int")
 10 (options, args) = parser.parse_args()
 11
 12 if not options.database or not options.port:
 13     parser.print_help()
 14     exit(1)
 15
 16 total_files = 2000
 17
 18 def load_file(filename):
 19     try:
 20         print "loading from cpu: %d" % os.getpid()
 21         query = 'mysql -h 127.0.0.1 -D %s -u root -P %d < %s' % (options.database, options.port, filename)
 22         print query
 23         os.system(query)
 24         print "done loading from cpu: %d" % os.getpid()
 25     except e as Exception:
 26         print e
 27         pass
 28
 29 os.system('echo "delete from store_sales_fact" | mysql -h 127.0.0.1 -u root -P %d' % options.port)
 30 p = multiprocessing.Pool(processes = 2*multiprocessing.cpu_count())
 31 for j in range(0, total_files):
 32     p.apply_async(load_file, ['data/data%d.sql' % j])
 33
 34 p.close()
 35 p.join()

5. Running it

I started loading by issuing the following command.

time python load.py -D test -P 3306

After we do this let’s start htop to check the processor saturation. It looks pretty busy :) .

MemSQL uses lockfree data structures that eliminate a lot of contention.

It took a bit of time, but the data has been loaded.

 real    39m21.465s
 user    33m53.210s
 sys     5m24.470s

The result is almost 200K inserts a second for a table with 4 indexes. The memory footprint of the memsql process is 267 Gb.

Running the same test against mysql

I would not be fair to skip comparison with mysql, particularly that it’s so easy to do since memsql uses mysql wire protocol and support mysql syntax. I used the my.cnf settings from http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql
and fired up the same command

time python load.py -D test -P 3307
real    536m0.376s
user    27m17.130s
sys     5m36.780s

I did not disable indexes for this run to make it fair compared to MemSQL.

Conclusion

With the Peak Hosting offering of 512Gb of RAM and memory optimized software like MemSQL you can save full business days in data loading and get immediate, valuable insight into your data.

Reading Between the Benchmarks: How MemSQL Designs for Speed and Durability

Foreword

There’s been a lot of discussion recently about how MemSQL compares to MySQL and other databases. Prior to the recent controversy, we were planning to publish the post below, where we jump in and describe a workload that we ran against MemSQL, MySQL, and MongoDB, all configured for asynchronous durability, on a few different hardware scenarios. Before that, though, we’d like to address some of the comments.

First, we’ve built MemSQL with a sensible set of trade-offs around index performance and durability in mind. The customers that we’ve worked with and optimized for have indeed seen on average a 30 times improvement for their workloads. Of course, just the fact that their data now lives in memory is a major part of the boost. But an important point is that MemSQL’s ease-of-use is what gets them there in the first place. MemSQL helps to relieve the cost of engineering caching or NoSQL infrastructure that syncs back to a database like MySQL.

Of course, we must consider MySQL’s in-memory performance. Jelle, one of our interns, explores this below by tuning InnoDB to behave like MemSQL. But the short answer is that MemSQL is a lot faster. And that’s because the database has been engineered from the ground up to run in memory. Everything from the index data structures (lock-free skip lists and hash tables) to every detail of our durability (log and snapshot format, transaction buffer, etc.) was designed and built for the in-memory, archive-to-disk use case.

Even though MemSQL runs in memory, it logs transactions to disk as fast as the disk will write. MemSQL uses an in-memory transaction buffer that is flushed to disk in a separate log flusher thread. By the time a transaction commits in memory, it has been written to the in-memory buffer. If the amount of data in the buffer exceeds transaction-buffer MB, then writes block until there is space available. Both MongoDB and MySQL are often run with similar configurations.

We expose the performance of this component through the show status query as Transaction_buffer_wait_time. This value measures the cumulative amount of time transaction threads have blocked trying to insert into the transaction buffer. The default size of the transaction buffer, 128 MB, is a heuristic: most disks can easily sequentially write 128 MB/s, so you shouldn’t lose more than a second of data. Visit the durability page in our documentation for more information.

We haven’t yet encountered a client who has asked us to optimize MemSQL for synchronous durability. Nevertheless, we know how to improve it in our current design and can discuss this in a future post.

The post below is meant to be a seed for you to run your own experiments with MemSQL. Since we can’t cover every workload, hardware configuration, and database in a benchmark, we’ve built something simple that runs MemSQL, MySQL, and MongoDB and we hope it will enable the community to continue running its own experiments. The more data points we have, the more we can improve the product.

- MemSQL Engineering


When I joined MemSQL at the beginning of this summer, I decided to write a small benchmark to see on my own how fast MemSQL can be. One of MemSQL’s main strengths is its highly optimized lock-free skiplist implementation. This means that MemSQL should scale almost linearly on machines with many cores. To explore this, I wrote a benchmark comparing the performance of MemSQL, MongoDB, and MySQL. My benchmark simulates a simplified online multiplayer game, with the database responsible for tracking players, games and events. The code for the benchmark is available on github.com/memsql/bench and can easily be extended to other databases.

The databases

The two databases I compared against MemSQL are MySQL and MongoDB. MemSQL is wire compatible with MySQL, so a comparison is natural and very easy. MemSQL and MongoDB have radically different interfaces as MongoDB is a NoSQL database. However, MongoDB is a popular choice for social applications so considering its performance is worthwhile as well.

MemSQL is an in-memory database that stores all the contents of the database in RAM but backs up to disk. MongoDB and MySQL store their data on disk, though can be configured to cache the contents of the disk in RAM and asynchronously write changes back to disk. This fundamental difference influences exactly how MemSQL, MongoDB and MySQL store their data-structures: MemSQL uses lock-free skip lists and hash tables to store its data, whereas MongoDB and MySQL use disk-optimized B-trees.

Description of the benchmark

The benchmark tests performance of the database when used as the backend for an online turn based multiplayer game. It was inspired by mobile phone versions of games such as multiplayer chess. For the core game logic, the database stores players and games. To track statistics, the database also stores a log of all past game actions and average game lengths.

The database stores four kinds of objects:

  • Players. Every player has an ID, and two statistics: the number of games started and the number of games won.
  • Games. Each game has an ID, two players and a turn.
  • Game length statistics. We store a histogram of finished game lengths, composed of a game length and an integer counting the number of games.
  • Events. Finally, every action performed is stored as an up to 32-character description, together with a timestamp and a player ID.

The benchmark simulates many players simultaneously playing games. Several Python worker processes each simulate hundreds of players. Each worker continually picks a player and then simulates one player action. All actions are logged in the events table. This logic is implemented in benchmark.py, and executes according to the following diagram: To support this simulation, the database needs to support several operations: find all games a player is currently playing, create a game, store statistics, etc. In the benchmark, these operations are described in base_database.py. Database-specific implementations can be found in sql_database.py and mongo_database.py. Since MySQL and MemSQL are protocol compatible, sql_database.py implements the interface for both MySQL and MemSQL. The MongoDB implementation can be found in mongo_database.py.

To support efficient queries, I added indices on all player ID fields. To allow efficient exploration of past events, I also added an index to the timestamp column in the events table. In MongoDB, the timestamp column is integrated in the object ID and is thus simpler. Furthermore, in MongoDB, embedded documents can be used as an alternative to seperate queries with indices. In this benchmark, I could have stored the list of active games in each player object. However, since each game is used by two players, I decided against this approach as it would move all writes to the player table. That would be an undesirable situation as MongoDB has a per collection write lock.

Settings used

By default, MySQL flushes after every transaction. To make MySQL behave like MongoDB and MemSQL, I set up MySQL to use a transaction buffer. The settings different from the default Ubuntu settings are the following:

default-storage-engine=INNODB
innodb_buffer_pool_size = 4GB
innodb_log_file_size = 128M
innodb_log_buffer_size = 4M
innodb_flush_log_at_trx_commit=2
innodb_thread_concurrency=0
innodb_flush_method=O_DIRECT
innodb_file_per_table

MemSQL and MongoDB are running with their default configurations, which means both are durable and use a transaction buffer. This means that all databases are configured to behave the same way. I used the latest stable version of each database (MongoDB 2.0.6, MySQL 5.5.25, MemSQL 1b).

Results

I ran the benchmark for 10 minutes on several machines. I first ran it on my old MacBook Pro (running Ubuntu natively), which has a dual core Intel(R) Core(TM) 2 Duo T9400. After that, I tried a server with 8 Intel(R) Xeon(R) cores, each running at 2.4GHz. Finally, I tried a 24 core machine with AMD(R) Opteron(TM) processors, each core clocking in at a 1.9GHz. For the two and eight core machines, the benchmark client was running on the same server as the database. For the 24 core machine the benchmark was running on a different server, directly connected over gigabit ethernet. The benchmark is configured to use 140 worker processes and 4000 players per worker process. This can be changed in config.py. I achieved the following number of actions (as described above) processed per second:

In all of these configurations, MemSQL outperforms both MongoDB and MySQL. Going from 2 to 8 cores, MemSQL improves its performanced by a factor 5. MySQL is almost four times as fast, and MongoDB is almost three times as fast as in the 2 core scenario. When going from 8 to 24 cores, MemSQL performs more than 3 times faster. However, MongoDB and MySQL perform slower than before. To verify this effect, I limited MongoDB and MySQL to run on 8 of the 24 cores, and they performed about as well as they did running on all 24 cores.

In this benchmark, MemSQL is significantly faster on all fronts. On two cores, MemSQL benefits from its efficient query parsing and pre-compiled code per query. When scaling from two to eight cores, all three databases are able to take advantage of the faster hardware. However, MemSQL scales better thanks to its lock-free data-structures which allow all 8 cores to interact efficiently with the database at once. The 24 core machine has individual cores slower than the individual cores of the 8 core machine. MongoDB and MySQL are not capable of taking advantage of these extra cores; instead, their performance degrades due to the slower cores. MemSQL on the other hand has a speed increase, as it is able to take advantage of all the available cores.

As a final note, the performance of all databases does not change significantly over time. I sampled the number of actions per second every second over the 10 minute run on the 24 core machine, and performance doesn’t change:

 

Conclusion

In the end, this simulation is just that: a simulation. But it demonstrates that MemSQL is very fast and scales well for write-heavy workloads like this game. We want this code to serve as a template for benchmarking MemSQL and other databases.

We made a strong effort to configure MySQL and MongoDB fairly. These tests can also be run on a variety of different hardware scenarios, network configurations, and modified workloads. There are even tune-able parameters within the provided benchmark (number of worker processes, number of players). We are as interested as you are to see your findings with MemSQL against other databases and in other benchmarks.

To explore MemSQL on your own, visit the MemSQL download page and get the developer edition. If you choose to set up a MemSQL AMI on Amazon EC2, you can get the database up and running on an 8 core machine in less than 5 minutes. The code for this benchmark is available on github. Happy Benchmarking!

Where Should Top Coders Work?

My career as a software engineer really began when I won a medal at the ACM ICPC programming contest in 2001. To place in the tournament, I had spent 24 hours traveling from Russia to Vancouver and back, just to spend 5 hours on the actual competition.

The rules are simple: you have 5 hours to solve up to 12 problems. For each problem you need to implement a small program in Java or C++ and send it to the jury. They compile it and run it through an extremely intensive set of tests. Only if it passes every test will the jury count your submission.

Even though it only marked the beginning of my career, it had taken 4 years of intense preparation to place at the tournament – 4 years of training, learning to think fast, practicing on weekends, and other sorts of mental gymnastics. ACM was a good school for me. I
loved it, but I wasn’t alone.

Is it really hard to become a Top Coder?

Becoming an ACP ICPC medalist in 2011 is about 20 times harder than it was in 2001. People who can do it now are coding machines and algorithm junkies. They practice every day on TopCoder, Google Code Jam, and Code Force to stay sharp. They memorize whole books filled with algorithms and equations in case they need a tool to solve a problem quicker than their opponent. It should be no surprise that this programming subculture has caught the attention of many great companies.

So where do programming contest winners go and work? It used to be Microsoft (where I started my career), then Google. Now it’s Facebook. These companies know how valuable top coders are. Microsoft, Google and Facebook wouldn’t be where they are without exceptional engineers.

How do Top Coders fare after graduation?

They fare very well and tend to build their careers at big companies. Many are attracted to the high salary right out of school. However, according to topcoder.com, many ultimately find work at large companies to be tedious and less challenging than the mental jujitsu of programming contests.

Large teams have the luxury of compartmentalizing a problem to reduce complexity. This creates an unfortunate side effect: these smaller problems just aren’t that interesting.

In addition, many ICPC champions soon miss the dynamic of small teams, the kind their experienced while training and competing during university.

As it so happens, there is a time in a company’s history when it’s perfect for a top coder to join – when the company’s just getting started.

Why do some Top Coders found/join an early-stage startup?

For the challenge, of course.

Just take a look at these real-life examples:

  1. Adam D’Angelo was a finalist in the international Topcoder Collegiate Challenge in 2005. Later he was VP and CTO of Facebook and then left to found Quora.
  2. Nikolai Durov is Employee #1 at the biggest Russian social network vkontakte. It beats Facebook on the Russian market. He could’ve gone to Google or stayed in academia, but he made a small bet that had lots of upside.
  3. One of my team members Leonid Volkov went to an early stage company and built the “TurboTax of Russia”. He enjoyed an incredibly successful exit, and he’s since gone into politics.
  4. Prasanna Sankaranarayanan is the founder of LikeALittle and was the highest ranked Top Coder in India. It took him one year at Microsoft to realize that the perfect job for a Top Coder is at an early stage startup.

Small startups offer Top Coders lots of responsibility and with it the trust and autonomy to solve tough problems.

Top Coders at MemSQL

Today, MemSQL has four Top Coders. I won a bronze medal in ACM ICPC in 2001, and Alex Skidanov was a Top Coder #13 in 2008 and ACM ICPC #3 Champion. A Top Coder would feel right at home at MemSQL. Here, we work on hard algorithmic and systems-level problems, distributed systems, and cloud infrastructure. We also now have Top Coders who ranked #4 and #8 in algorithms. Sometimes it’s scary to leave a big corporation, but the truth is that apart from fun, market pay, and the potential of a huge upside, a good early stage startup gives the kind of experience that makes a Top Coder extremely relevant in today’s tech industry.

If you’d like to learn more, shoot us an email at topcoders@memsql.com.

Welcome to the MemSQL Developer Blog!

We’ve been hard at work building the world’s fastest database, and now that we’re shipping MemSQL, we’re looking forward to having a bit more time to blog about some of the fundamentals around the MemSQL technology.

In the coming weeks, we’ll be publishing posts that cover an array of topics, including benchmarking, stress testing, database theories, algorithms, and more.

In the mean time, we encourage you to download MemSQL and have some fun doing your own benchmarking. We’ve also published a workload simulator on Github to help get you started in the right direction.

Thanks for dropping by and we look forward to a lot of great posts and discussions.