One of the key parts of data governance is data architecture. Architecture is all about how data is tracked and made use of in an organization. Why is this all important? Well, if you don’t have a view of the exact data you have, how can you possibly govern it?
During this section we will talk about:
- Policies (again). It’s a bit about data usage guidelines
- Data models – logical and physical data models
- Data catalogs – what have we got?
- Data lineage / provenance
- Third party system connectivity
- Certified data sources
Let’s start with policies. We’ve already talked about them a lot, so I won’t say much about them here. It’s just to reinforce that we need a set of rules and policies in place that tell users what they can and can’t do with the data that we store. As we know, humans are the weak link in data security, so leaving them in no doubt over what they can and can’t do is vital.
Next, we are going to talk about data models. We have two types, logical and physical.
A physical data model shows the table schemas – that is, the column names, data types, domain restrictions (constraints), primary keys, foreign keys and relationships between tables.
The logical data model is a high level version of the physical model. It shows the fields involved and the relationships between tables, but it doesn’t go into detail about the data types or constraints.
The physical model refers to the physical implementation of the tables. Like, name is a string with a restriction of 100 characters. The logical model just looks at how it’s going to hang together, without those implementation details.
Data catalogs are just like the old fashioned catalogs you might have received through the post for clothing. It lists out all the data available in your systems, along with data types and detailed field descriptions.
Field name | Data type | Length | Description |
When you’re a company with tens of systems with strangely (and not very well thought out) field naming conventions, this sort of catalog is absolutely required, so you know what you have and where you can find it.
Next, we have data lineage, which is all about finding out what happened to your data. Where did it come from? What happened to it once it landed in the system? What ETL process ran on the data?
This is key functionality as it enables us as data managers to handle the data better. We know what has happened to it and where it came from, so we know that it has been collected in-line with the regulations and that nothing has happened to it to affect the accuracy.
Next, we need to define the rules around third party system connectivity. This is particularly important in the world of self service analytics. If you connect Tableau to your datasource, how do you make sure that the right people are seeing the right data – how are you going to make the connection? Is it a secure connection? All of this needs to be planned and mapped out before any action is taken.
We can also look to create certified data sources. This is where we create a datasource for users to consume. Again, this is key in the world of self service analytics, where we need to make sure that people are looking at accurate data and that we restrict access to the source data.
This lets users create their own insights without having access to data that we don’t want them to see. It’s an extra level of security, while also enabling additional flexibility.