– Big Data Engineer at Customlytics
In the third and last part of the Customlytics data warehouse series, I like to finally talk about the daily routines of managing a data warehouse. Previously to this article, we encouraged you why to invest in a data warehouse, and we investigated the architecture of a data warehouse. In case you haven’t, we recommend to take the time and read the previous articles before continuing.
General aspects to consider when managing a data warehouse
Once you have set up the data warehouse and the data is coming in steadily, what are the aspects you should take care of? Here is a list of essential details that I think are crucial, and which will be put into perspective in the following.
- Monitoring – Enables you to inspect system performance and quickly spot if things are breaking down.
- Business Intelligence (BI) tool – make the collected data accessible and visible for everyone.
- Documentation – As part of good data governance, we suggest that you document your data warehousing.
- Integration of new data sources – think about how to integrate new data sources with minimal effort.
- Data discovery – with the increasing number of data sources and tables it becomes harder for everyone to find the right dataset for the right purpose so having the right tools in place is important.
Now, let’s put these aspects in order and perspective by digging deeper into the means of maintaining your data warehouse.
Daily routines of managing a BI tool
Often, the decision which BI tool to use in order to visualize data, dashboards, and reporting has already been taken alongside the decision to actually build a data warehouse. Most definitely someone on the team has gained experience with some of the BI software that is out there in the market and can recommend a tool according to features and price model.
So what aspects do you have to take into consideration on a daily basis for a BI tool?
First of all, user and user rights management. This means granting users access to the BI tool and making sure they have the correct rights according to their role in the company. Stakeholders in your company want to get the data and insights according to their needs. For instance, finance needs to see different dashboards as opposed to the sales department.
Secondly, depending on if you decided to go with an on-premise solution, regular updates have to be performed. Choosing a cloud solution for your Business Intelligence, however, saves you a lot of time and hassle because you don’t have to take care of updates.
Documentation is the foundation
The documentation is something you should start along with the creation of your data warehouse and should encompass design decisions, a technical description, database schemas and a description of all data processing pipelines. The document should include description and imagery at varying levels of detail for different types of teams and stakeholders. The main goal of good documentation is to ensure that everyone is headed in the same direction. Detailed documentation is necessary for colleagues working on or with the data warehouse. Whereas a more abstract form of documentation can be provided for other stakeholders in the company and presentations. But documentation is worthless if it isn’t kept up to date. So yes do it, it will save you a lot of trouble and time.
The next point I like to mention is monitoring. It needs to be done or otherwise, you realize at some point no data has been ingested. This can be very crucial for your business and money will be lost. That’s why keeping a close eye is important. This brings me to the point of what should be monitored:
- all data imports (batch or real-time ingest) (i.e. API or server errors)
- data processing job(s) (i.e. failed scheduled queries)
- velocity and volume of the data.
But sometimes not an error indicates that something is going wrong. Sometimes a drastic change in the amount of data you receive is already indicating a problem and requires you to take action. Consider the following example: a change was applied to an API you use and afterwards the amount of data drops, which might be a sign that you don’t receive the full data. So make sure the amount of ingested data is also monitored.
A situation you will encounter and should plan ahead is to integrate an additional data source. So what could be done to make the process easier?
Make a list of all the tables you plan to create or change. This list will serve you as a checklist for the Q&A part and updating the documentation. So don’t miss anything.
Also, spend some time to think about how to orchestrate the new data import to the existing one. There could be some constraints (timewise) from the tables and their data sources you plan to change.
I would recommend making a staged deployment. First, deploy the new data import and make sure it works. Afterwards, start to change the data pipeline. At last, don’t forget to keep the documentation up to date.
Last but not least I want to talk about data discovery, which will become an issue the longer the data warehouse exists and the more first and third-party data sources are added.
Examples for first-party data are Facebook or Google Adwords and a third party example is the MaxMinds IP2Geo database, which maps IPs to geolocation.
If you have a well-documented data warehouse you are halfway through it. The information provided in your database just needs to be accessible to everyone in a way that they can find the right dataset for the question, they are looking to answer.
Let’s recap at the end and summaries it. With having built a data warehouse the work isn’t done. The biggest part is about to come, maintaining the data warehouse. Documentation is important and shouldn’t be neglected because it serves as a starting point for other tasks (like data discovery or integrating additional data sources).
Making the move to a data warehouse strategy? We here at Customlytics cover all app marketing and analytics topics. We’re here to help if you need actionable tips for your data warehousing strategy. Drop us a line via email [email protected]