CDOT - New Database Schema, New Queries, Documentation Finished, Marketing the Project
April 05, 2017
I apologize to anyone who follows this blog for the lack of updates. I had technical problems posting on WordPress but it seems to be working now.
A lot has been done in the last few weeks. There are a few highlights of our work that I think are worth talking about.
New Database Schema
Previously, we were storing the information we collected in multiple "collections". MongoDB collections are very similar in concept to SQL tables. They're a logical way of separating documents that have clearly different roles. We had collections for:
- Hits - when someone visits a page
- Sessions - when someone's browser begins a session for a bunch of visits
- Resources - the data associated with the web pages the site running Rutilus has (ex. articles)
All of these collections have now been merged into one called "Visits". So where before we had one resource document associated with many session documents, and one session document associated with many hit documents, it's all flat now. For every single "visit", the data for its session and the data for its resource is repeated.
There were advantages to what we were doing before. The style made a lot of sense in the world of relational databases, where you want to be efficient with storage space. If a chunk of data needs 1 MB to store it, why repeat that a million times? Just have a reference to it in those million things that are associated with it. In fact, MongoDB has the $lookup aggregation operator, to let you mimic doing SQL joins and fetch that referenced document. Unfortunately, we can't even use $lookup, because it isn't supported when you shard your database (scale it horizontally), and we don't know if MongoDB without sharding will be enough for every Rutilus user in the world.
We ran into serious problems with our approach breaking it down into multiple collections. The problems began when we had to do joins. Since it can't be part of the query, which would let MongoDB's process do all the heavy lifting and math and just send us the result, we had to do multiple MongoDB queries, fetching each associated document, and working with that data in the Rutilus processes to prepare it to the final query result. This simply used too much memory. If the database stores more than 1 GB of data, and they want to look at it all, that means storing more than 1 GB in RAM during that query. Only using a 1 GB RAM AWS instance? Boom. It crashes. So storing the data in a way that doesn't require joins in the Rutilus processes lets us be way more efficient with our RAM usage. The MongoDB process handles it, and MongoDB's team are experts at that. We should take advantage of that expertise. At the end of the day, repeating our data in each Visit document is okay, because we don't need to go back and update it, and storage space, compared to RAM, is cheap. We're okay with 10x the storage space being used if it means this thing actually runs.
This was a particularly exciting development. Thanks to working with the industry partner, Engineering.com, over the past few months to see how they liked what we were making, we got feedback over time, and worked closer and closer to a set of queries the Dashboard tool can have to be truly useful. The queries have really settled into a well-defined form that should prove to be simple but customizable enough to provide any view needed into the data collected.
Here's how they work:
Resources are important, and visitors are important. People using Rutilus want to know how well a particular resource performs, and how engaged a particular visitor is. So there are two queries for this:
- Resource orientation - Displays calculated values for each property of a Visit we collect for each resource. Ex. the total number of Facebook shares it has among all its visits, the average number of Facebook shares it has per visit, how long people were on the page, etc.
- Visitor orientation - Displays calculated values for each property of a Visit we collect for each visitor. Ex. the total number of Facebook shares that visitor performed across all their visits, how many visits they performed, etc.
This is focused and useful for analyzing the data. It becomes flexible when you add in the "filters" we created for our older queries which allow you to add a "$match" operator (like an SQL WHERE clause) to our MongoDB queries and restrict the data examined. For example, they can only consider logged in visitors (which we call users). That way they can compare the engagement of visitors vs users. They can add a filter for date, restricting to a certain 3 month period, to allow them to break these queries down into quarters or any other time period, turning it into a powerful reporting tool.
The queries are working and documentation will be created soon to reflect these changes to the Dashboard tool.
We kept the original custom queries (like displaying timelines for visit actions, and graphs and charts for scrolling down a page, etc), and the export queries for when they just want to look at the raw data. The raw data is streamed out of the database, so that it doesn't have to all sit in the Rutilus Analytics module process and cause out of memory issues.
At this point, the documentation is in a finished state. Every topic is covered. I'm still taking complete responsibility for the documentation aspect of the project while my team mate focuses on coding these last minute features and searching for bugs, so my work will continue over the last few weeks to improve the documentation with more examples, proofread it, and make any changes needed thanks to the new features.
Marketing the Project
The time has come to share Rutilus with the world. After speaking with other CDOT teams to get some advice about their own open sourcing efforts, I created a MailChimp mailing list for the software, so that we can collect subscribers and mail out project updates, for example for big releases. I also created a Google Group for the software to act as a discussion forum. It should serve as a good area for us to help new users sift through any technical issues they have trying it out.
I then carefully read the rules around self-promotion on Reddit and posted in a few relevant subreddits (webdev, analytics, etc) about the software to spread the word. I threw Google Analytics onto our documentation site (ironic, I know) and I do see activity as people come check it out, so time will tell how this does in getting people interested.
Note: This was originally posted on the blog I used for my co-op term while at Seneca College (mswelke.wordpress.com) before being imported here.