Data Warehouse - Why the Notes Fields are Capped

Introduction

We occasionally receive queries from clients as to why we limit the length of text fields in the Data Warehouse instead of bringing in the full text from, say, notes fields which are available in the front end.  This article explains the reasoning behind this.

What the Data Warehouse is, and what it isn’t

Bud’s Data Warehouse is an Online Analytical Processing (OLAP) system, designed to perform multi-dimensional analysis at high speeds on large volumes of data.  An OLAP database is intended for data mining, business intelligence and reporting, and for conducting complex data analysis to inform smarter decision-making.

To this end, the purpose of a Data Warehouse is to provide a set of quantitative measures that can be aggregated and visualised, as opposed to qualitative narratives on which we can perform scant analysis.

Technical Rationale

The Bud Data Warehouse uses a multi-dimensional model based on the Kimball methodology.

Were we Kimball purists we would not have any notes fields in the Data Warehouse at all.  However, we recognise that some clients find it useful to have notes fields available in the Data Warehouse, so we have reached a compromise of allowing notes fields in, but with a cap on length.  This cap is typically at 250 or 500 characters.

The cap is in place to ensure acceptable levels of performance in terms of the refresh cycle time, as well as how long it takes for clients to download data from the Data Warehouse.  The most powerful argument for a cap however is in the PowerBI tool itself, and an explanation of this is given below.

PowerBI and Vertipaq

Under the hood, PowerBI uses a memory-based column-store technology known as the Vertipaq Storage Engine.  This engine compresses data vertically on a column-by-column basis as opposed to row-by-row in a traditional row-store database.  This enables higher compression to such an extent that an entire model can be held in memory, even on a moderately-specified laptop.

One of the techniques used to achieve this compression is known as dictionary encoding, and this is used for text columns.  Essentially this maps each distinct value of a string field to a simple integer value, e.g.

String Value

Integer ID

London

1

Bristol

2

Liverpool

3

Newcastle-upon-Tyne

4

Cardiff

5

Manchester

6

Glasgow

7

 

If we only have seven different values for the city field, but one million rows, you can see that we only need to hold one byte to represent the city for each record.  This compresses down extremely well where we have a small number of distinct string values (low cardinality), allowing us to hold those million values in a small amount of memory.

A lookup table as illustrated above is known as a dictionary.  Now we can see if we have a million rows, with a million different values for the city (high cardinality), the dictionary rapidly becomes very large.  Compound this by allowing very long strings, then the dictionary itself takes up a significant proportion of the memory available, increasing the size of the model and reducing the performance of the visualisations.  In the extreme case, we find the model can’t be fully loaded into memory, and the reports will fail to render at all.

In summary, we can see that PowerBI is optimised for fields of low cardinality, and will perform poorly with a multitude of very long string values.

For further reading here are some links to useful articles: