I am building a url shortener and I use InnoDB as the storage engine for link data. I will start with a single VPS containing both application instance and MySQL database instance on the same Virtual Machine. I want the application to be easily scalable, I want to be able to add new nodes and make the system perform better as easily as possible when needed.
I have been reading about MySQL Cluster and MySQL Replication, but I haven’t been able to decide on which one to use. You can imagine that (a popular) url shortener will be both write and read intensive. What is the structure that you would use in such a case? Would you go for cluster or replication?
Then based on the choice of cluster or replication, what is the infrastructure/configuration that I am supposed to have in order to be able to expand from a single innoDB engined Database to a cluster or replication structure? I want to start correctly, I don’t want to be stucked in a situation where I can not expand/improve the database system when I need to.
Thanks A lot
MySQL Cluster possibly a good fit for you because of your key:value lookup traffic pattern. NDB is traditionally poor with joins or compex queries. I would say however that it is not the same as MySQL and has caveats that you need to understand before jumping into bed with it. It’s shared nothing architecture means that you should operate it at scale with decent (and inclusive redundancy) network infrastructure. It is perhaps worth a proof of concept project to fit it out and test it. The guys at severalnines have configuration tools to get you up and running.
Mongo isn’t a bad fit either and as you state you cannot give up A_I, why not? What about values like (1902jdd,”/page/id”) needs and auto inc? Mongo documents are schemaless and you add superfluous data to a document where you see fit and query in a key:value style too. Statistics moving forward would put you into the map reduce world which could be fun.
Keeping with traditional RDBMS and requiring performance you should look at memcached and the mysql memcached api for rapid key:value access to the data which would retain the relational aspect for aggregation queries (select [count|max|avg]… group by order by blah, blah…
MongoDB – it has automatic sharding and you will be able to add nodes one after another. Otherwise, to scale writes you will need to shard mysql one way or another and cluster will not help you.
(I see that this is a terribly old Question, but I think I have an innovative MySQL answer.)
The single table is something like this:
CREATE TABLE x ( url BLOB NOT NULL, md5 BINARY(16) NOT NULL, -- of the url, UNHEX'd into BINARY short VARCHAR(12) CHARACTER SET ascii NOT NULL );
The servers will not have identical tables; see below.
Set 1: Primaries — Sharded on md5 — write-once servers — Just for creating new links (or finding that it already exists):
PRIMARY KEY(md5) ENGINE=InnoDB
Set 2: A single server in the middle (no sharding) — Just for facilitating many-to-many replication:
Set 3: Replicas — read-only servers — Just for looking up short-urls:
PRIMARY KEY (short) ENGINE=InnoDB
“2” is the Replica of “set 1” and the Primary for “set 3”. “2” uses “multi-source replication to simultaneously accept data from Primaries (set 1).
short is a sequential number mangled into letters and digits. You should decide whether to use hex, base64, or something custom. A custom mangling would let you avoid confusables like
2/Z, etc. And maybe you want to avoid upper/lower case.
I say “sequential” because I want set 3 to be efficient by only inserting at the “end” of the table. Once the data gets to “3”, and because sharding is on the ‘end’ of
short, each shard will have a “hot spot” at the “end” of its copy of the table. I assume most requests will be for recently-constructed short-urls, hence near the end of the table. You could use small servers (or VMs) since you won’t need much room for the buffer_pool.
Set 3 is not “shared”; each server has the complete dataset. The modulo technique is for load balancing and because most queries will be ‘large’ values.
Meanwhile, new entries are written to set 1 based on some bits in the md5. The problem with md5 (or any other one-way hash) is the randomness. That leads to jumping around a lot in the table. If you have enough servers in set 1 so that all the data can reside in the buffer_pools, most of the I/O is eliminated.
So, I have minimized I/O on both the writes and the reads. The guy in the middle (2) can probably be
ENGINE=BLACKHOLE, since all that it does is pass rows from eacn of set 1 to each of set 3. Blackhole would be more efficient than InnoDB.
Blackhole does not write the data at all (aside from replication logging in and out).
There are two generic problems I have not discussed:
A big challenge in sharding is how to change the sharding when you add a new server.
Another task not discussed is how to route queries (the one type of write and the one type of read) to the appropriate shard.