Database choices for big data

Posted on

Question :

I have many text files, their total size is about 300GB ~ 400GB. They are all in this format

key1 value_a
key1 value_b
key1 value_c
key2 value_d
key3 value_e

each line is composed by a key and a value. I want to create a database which can let me query all value of a key. For example, when I query key1, value_a, value_b and value_c are returned.

First of all, inserting all these files into the database is a big problem. I try to insert a few GBs size chunk to MySQL MyISAM table with LOAD DATA INFILE syntax. But it appears MySQL can’t utilize multicores for inserting data. It’s as slow as hell. So, I think MySQL is not a good choice here for so many records.

Also, I need to update or recreate the database periodically, weekly, or even daily if possible, therefore, insertion speed is important for me.

It’s not possible for a single node to do the computing and insertion efficiently, to be efficient, I think it’s better to perform the insertion in different nodes parallely.

For example,

node1 -> compute and store 0-99999.txt
node2 -> compute and store 10000-199999.txt
node3 -> compute and store 20000-299999.txt

So, here comes the first criteria.

Criteria 1. Fast insertion speed in distributed batch manner.

Then, as you can see in the text file example, it’s better to provide multiple same key to different values. Just like key1 maps to value_a/value_b/value_c in the example.

Criteria 2. Multiple keys are allowed

Then, I will need to query keys in the database. No relational or complex join query is required, all I need is simple key/value querying. The important part is that multiple key to same value

Criteria 3. Simple and fast key value querying.

I know there are HBase/Cassandra/MongoDB/Redis…. and so on, but I’m not familiar with all of them, not sure which one fits my needs. So, the question is – what database to use? If none of them fits my needs, I even plan to build my own, but it takes efforts :/


Answer :

I have been in a similar situation and highly recommend looking into MongoDB or HBase for your needs. The others (Cassandra/Redis) are good too, but HBase and MongoDB have been battle-tested in large environments, have a large community, good documentation (MongoDB’s is better) and generally fulfill the three requirements you are looking for.

  • How big are the keys?
  • How big are the values?
  • Will the data arrive “continually”, or is this a one-time load?
  • Let’s see all the representative queries. You have mentioned only

SELECT val FROM tbl WHERE key = 'constant';

Are there others ?

In general key-value stores suck big time. But maybe we can make your special case work not-too-badly.

Leave a Reply

Your email address will not be published. Required fields are marked *