Which database technologies to consider for a data search application?

Posted on

Question :

I’m designing an application that only does read queries to some database tables (it’s a kind of search application); the queried data is updated at night by the system (so speed is less critical, while however very important here too), while application users only do fulltext searches within the stored data.
The retrieval performance is then a key point of the application.

Can you suggest links, Google terms or arguments and other material that I can read to properly design those parts of the database that will be involved in the users searches?
I already know the basics about indexes, and checking the queries with EXPLAIN, I’m reading up about table partitioning to see if there’s anything that can help me.

The application will run on MySQL 5.6
So far its builtin FULLTEXT search is enough for our requirements, I’m looking more at hints on how to properly structure the database so that I don’t add unnecessary slow downs (by leveraging the fact that no user query will update the data).

What other database technologies should I look for?

Answer :

Fulltext Search is a specialized technology that complements a traditional SQL database. There is a simple full text indexing feature built into MySQL, but you will probably find other alternatives that are more featureful and higher performance.

I did a presentation comparing them here: Full Text Search Throwdown.

The short answer: Sphinx Search.

Re your comment:

It seems your question is not about a specific solution but just asking for categories of performance optimizations.

  • Indexing. How to Design Indexes, Really is my presentation, which I’ve given several times. Here’s a recording of me presenting a version of this talk: Tools and Techniques for Index Design.
  • Partitioning. Partitions Performance with MySQL 5.1 and 5.5 by Giuseppe Maxia. This was the one that made partitioning click for me. Partitioning is very powerful when it works, but it works for a small number of cases.
  • Caching. What’s the fastest SQL query? The one you don’t have to run at all, because you can fetch a copy of the data from a high-speed cache. This requires more case-by-case application coding to decide on the right data to cache, what format to store it in, and how to refresh it. Lots of work, but can speed up certain data access by orders of magnitude.
  • Denormalization. Although denormalization creates opportunities for data anomalies, sometimes it’s the best choice to optimize for a specific query. Beware though, because denormalization degrades performance for all other queries against the same data.
  • Sharding. Useful when your data or query traffic grows so large that a single database instance can no longer process enough queries per second. Split your data horizontally, storing a subset on each shard instance. Add code to your application to decide how to send your queries to the appropriate shard. Sharding is used by MySQL Cluster, MySQL Fabric (which is still pretty alpha as of this writing), and Shard-Query.

Finally, the advice I often give about optimizing is that every method listed above optimizes for one query at the expense of others. For example, sharding is great if you read and write individual rows, but sharding makes things worse if you query for ranges of rows that span multiple shards.

Leave a Reply

Your email address will not be published. Required fields are marked *