I am developing an application that requires handling tables that do not necessarily have an uniform structure (eg. different/extra columns). Those unstructured files also need to be grouped (say 5 table together makes a group, each table has a different content). And each file in the group needs to be versioned preferably in a simple and efficient way (eg. no need to copy the entire file if only a row is changed).
Current solution I thought of:
- Define a schema with some standard fields that application needs, such as
name of a file group, ..., timestamps, path_to_repository, hash_of_repository_content (to invalidate), status (eg. being_updated, for concurrency) }
path_to_repository, I’ll have a git repository and keep files in csv format (create a submodule for each file so that each file in a file group can be versioned independently).
- In order to access those files, I’ll get the repository and head will be the current version of the files. Then I can do the versioning easily, can keep tables of unlimited size without any structure and wouldnt be limited by mongodb’s max file size. If there are any further information to be versioned I can simply add them to the repository.
- To handle concurrency, I’ll update a flag in the mongodb, so it would take care of the concurrency for me. If two applications simultaneously try to write into repository, they’d set a flag/lock first. if flag is already updated by someone else, they need to wait.
I am not sure is this is a good practice (whether I’m re-inventing the wheel or not, would it have a good performance, can this scale, if there is any problems I dont see). Any advice/experience/criticism would be great.
Probably irrelevant extra information:
this is going to be used by
nodejs/mongoose and also by a python script.
The decision about what to put in the document is pretty much determined by how the data is used by the application.
The data that is used together as users documents is a good candidate to be pre-joined or embedded.
One of the limitations of this approach is the size of the document. It should be a maximum of 16 MB.
Another approach is to split data between multiple collections.
One of the limitations of this approach is that there is no constraint in MongoDB, so there are no foreign key constraints as well.
The database does not guarantee consistency of the data. Is it up to you as a programmer to take care that your data has no orphans.
Data from multiple collections could be joined by applying the lookup operator. But, a collection is a separate file on disk, so seeking on multiple collections means seeking from multiple files, and that is, as you are probably guessing, slow.
Generally speaking, embedded data is the preferable approach.