Today we decided that it would be a good idea to be ambitious and go for my original "plan A" front end API. This is the one that would dynamically read the schema as the user builds queries using a GUI to guide them along and create any MongoDB query possible. Investigating this route is a good idea because it would provide with a tool that would stand the test of time. It's a much better long term solution than a tool that only has a handful of queries we've created today based on the current database schema.

The main challenge with creating this front end API is learning what MongoDB code would map to the things the user would click on and the things the user would type in. MongoDB calls this their "Aggregation Framework". Because MongoDB lacks many of the features that relational databases have (like joins), they have provided their own ways of doing complex analytical queries. For example:

  • SQL has "SELECT" to restrict what you get back from what you match. MongoDB has "$project" for this.
  • SQL has "WHERE" to restrict what you're matching. MongoDB has "$match" for this.
  • SQL has "INNER JOIN" to allow to relate a dataset with another in a normalized system ("show me the users who visited at least 5 times today"). MongoDB can embed the hits into the users in this example with "$lookup".

From above, it's clear that it's kind of simple how things line up... but the complexity comes from arranging the MongoDB query the right way. It's arranged differently compared to the SQL queries we're used to writing. And even though we can use $lookup in place of some joins, memory usage is a concern. MongoDB has a "working set" which is the data it's looking at at a particular moment during the execution of a query. In the above example, finding hits and users and relating them, all of the hits and all of the users might have to be held in memory. With the kind of scale we have, this may be a problem. Additionally, you can't shard your database when you use $lookup, which is key to use scaling our solution so that it suits long term.

We have more research to do to make sure what we do will be a good, long term solution, and that we even have time to implement this complex querying and data analysis tool.