Introduction
Bud allows you to extract your data from our data warehouse into your own organisational data warehouse. This is known as an ETL Integration (Extract-Transform-Load). ETL tools include Azure Data Factory, SQL Server Integration Services and Talend, in fact, Bud supports any tool that can connect to an Azure SQL Server.
How to: Set Up an ETL Integration
In supporting such integrations, it’s important that these run as seamlessly and friction-free as possible, so we have developed the following processes and recommendations:
Set Up
The first step is to set up a ‘system’ login for your developer to use. This is distinct from a named-person login we would normally assign, so your ETL doesn’t immediately stop working if the developer leaves your organisation. Depending on your password management protocol, it would still be best practice to request a reset of the system password too, should a key person leave.
- Send an email to datawarehouseaccess@bud.co.uk asking specifically for ‘ETL’ logins. You will likely want at least two logins – one for your test tenancy to develop against, and one for your live tenancy. So please remember to request both, or alternatively just ask for test initially, and live when you’re ready. The usernames will typically be in the format ETL_TrainingProviderName_(Test).
- Don’t forget to include in your request the IP address/addresses of your server and/or development machine so that we can add these to our whitelist.
Documentation
The Data Warehouse is based around a Kimball model, and as such comprises various Fact and Dimension tables (Fact tables being suffixed with ‘Links’).
We have not been too strict on our Kimball interpretation, and where appropriate, we have added some tables by snowflaking off one of the dimensions, rather than creating a whole new Fact table as well.
- The full model of the Data Warehouse is documented in a Power BI Template file which is available from support; please request a copy of Supermodel.pbit.
Furthermore, once you have access the Data Dictionary can be found in the view Presentation.DWH_DataDictionary. This view is self-documenting as new fields are added.
Refresh Policies
As the data warehouse is a shared resource, we have a few recommendations that will help with balancing the load on the server throughout the day.
- Schedule your extract process to pull data during the second half of the hour. Our data warehouse refreshes from the front-end during the first half of the hour (e.g. between 08:00 and 08:30). By pulling data sometime between 08:30 and 09:00 this will avoid contention with the refresh itself, whilst providing the minimum latency for changes in the front end to be replicated through to your own system.
- Avoid ‘round’ times. Depending on your ETL orchestrator, if possible, schedule the extract to start at an arbitrary time in the second half of the hour (e.g. 08:34 or 08:41). It will run more slowly if everyone hits it bang on 08:30!
General Principles
We have a little more guidance and advice to help manage changes to the data warehouse.
- Avoid using the SELECT * construct in your ETL queries. Whilst fine for ad-hoc queries, please avoid this anti-pattern in your ETL code, as this will cause your process to immediately break when we add new a field to the respective view. This means you will have to work responsively to any changes we make, or else somehow align your development work precisely with Bud’s release cycle.
Best practice is to explicitly specify the list of fields in the SELECT statement. In this way, your developers can incorporate new fields into your dataset at any point after our release.