How is data lake and polyglot persistence different?

Posted on

Question :

I just came across a new technology called polyglot persistence. However, it sounds very much like a data lake. So, how are these two things different?

Answer :

The term “polyglot” (meaning “speaking or using several different languages”) was first used by Neal Ford* with respect to ICT in 2006 in a blog post in which he coined the term “polyglot programming” where he stressed that nowadays, it’s not good enough to be an expert in (for example) C and C only – a competent software engineer must be able to move between significantly > 1 langauge(s).

Interestingly, he also mentions the JVM (and/or .NET) as kinds of universal machines (which they sort of are) and where one would use different JVM languages for different tasks and then combine the bytecode into one executable – we see this now for example (an interest of mine) in Clojure and Java – you have the functional aspects of Clojure which you can combine with the massive Java ecosphere (think of the huge number of libraries for example) to get the best of both worlds. Scala and Kotlin are other examples of where this can also be achieved.

This concept of platform diversity was extended by Martin Fowler to databases in approx. 2011 when he coined (in what he calls his “bliki” – blog/wiki) the term “polyglot persistence”. His point was that different functionalities within a system can be best served by different databases – as per this image:

enter image description here

With the current move away from software monoliths and the trend for microservices where each area of a system is a different programme, possibly/probably living in a different container and/or server, this model is becoming more and more pertinent since the universal data interchange language of JSON and RESTful web services means that the impedance mismatch between different database systems is mitigated and not as important as it perhaps once was.

Now, as for data lakes, an excellent definition is available on the wiki (and excellent article BTW – well worth reading in its entirety):

A data lake is a system or repository of data stored in its
natural/raw format, usually object blobs or files. A data lake is
usually a single store of all enterprise data including raw copies of
source system data and transformed data used for tasks such as
reporting, visualization, advanced analytics and machine learning. A
data lake can include structured data from relational databases (rows
and columns), semi-structured data (CSV, logs, XML, JSON),
unstructured data (emails, documents, PDFs) and binary data (images,
audio, video).

So, the relationship between polyglot persistence and data lakes is clear, but they are definitely not one and the same thing!

To my mind, one can think of a data lake as an instance of polyglot persistence (in much the same way as an object is an instance of a class). Polyglot persistence is a concept, a data lake is a materialisation of that concept in action.

Unsurprisingly, traditional vendors are fighting back against this trend by trying to consolidate these database functionalities in their own products – Oracle for example has embraced JSON document stores into its flagship product (from here):

Oracle as a Document Store

Oracle Database 18c fully supports schemaless application development
using the JSON data model. This allows for a hybrid development
approach: all of the schema flexibility and speedy application
development of NoSQL document stores, combined with all of the
enterprise-ready features in Oracle Database 18c.

and it also has a separate Key-Value store – from here:

Oracle NoSQL Database is a solution for applications with the
following characteristics:

Produce and consume data at high volume and velocity
Require instantaneous response time to match user expectations
Developed with continuously evolving data models
Scale on–demand based on the dynamic workloads

It is worth nothing that the Oracle’s flagship database and their NoSQL one are separate products. Interestingly, the most “holistic” approach to polyglot persistence appears to be being taken by PostgreSQL. In their database, you can have a Key-Value store, JSON documents and of course, normal relational tables and one can perform SQL between and within these different storage types.

Amazingly, PostgreSQL’s performance as a document store is approx. 2.5 times better than MongoDB’s with JSON documents according to this benchmark. Admittedly, EnterpriseDB could be seen to be biased as they are PostgreSQL specialists, but they put the benchmark up on GitHub for anyone to test and I haven’t been able to find a refutation by MongoDB.

Finally, just a word of warning – while you may be able to do it, this might not necessarily be a good idea! I’ll leave the final word to 2ndQuadrant (PostgreSQL experts):

PostgreSQL has json support – but you shouldn’t use it for the great
majority of what you’re doing. This goes for hstore too, and the new
jsonb type. These types are useful tools where they’re needed, but
should not be your first choice when modelling your data in
PostgreSQL, as it’ll make querying and manipulating it harder.

One thing’s for sure, the database world will keep evolving and keep all of us in jobs for a long time to come! 🙂

p.s. +1 for an interesting question and welcome to the forum!

* Interestingly, he calls himself a “meme-wrangler” – I like it, kinda quirky

Leave a Reply

Your email address will not be published. Required fields are marked *