The first step we should undertake in order to write a data strategy is exploration / investigation. It’s really important to fully understand your starting point before trying to define what needs to be done.
The first output from this assessment needs to be a data inventory, this helps you to identify the data you hold; the platforms in which it resides and many of the controls around the data that are already in place. An inventory may include some of the below fields:
Data Source Name | What is the name of the database table; file system directory; etc.. which uniquely identifies this datasource. |
Data Source Platform | What platform is your data stored in (MySQL, HDFS, etc..) |
Datasource Users | Is this customer facing or internal data? |
Entities Defined | What entities are described in your datasource. For example, customers, orders, transactions |
Aggregated dataset | Is the dataset aggregated, ready for reporting? |
Cleaned dataset | Is the dataset cleaned & prepared for end-user usage? |
Risk: PII | Is PII included in the dataset? What are they? |
Risk: PII Purpose | Why is this PII data required? What is the business justification? |
Risk: PII mitigation | Is the data encrypted? Hashed? |
Access Requirements | Who should have access to the data? |
Data Type | Master Data: this describes objects: employees, customers, locations, office addresses, organizations; products; etc.. Unstructured data: is data without a fixed data structure: bodies of text; social comments; images; videos; documents etc.. Transactional Data: is transactional in nature, it usually has a timestamp associated to it and describes an event: a sale; an invoice; a return or an activity (e.g. the time you entered the gym when you swipe your membership card). Metadata: describes your data. It’s data about data: report definitions; database column descriptions; config files; etc.. Hierarchical Data: is data that describes objects with a hierarchical structure. For example, a family tree or org structure. In a cellular network; you may have a hierarchical model for family plans. The children are associated to their parents account. Reference Data: is static data typically. It could include timezone data; location information etc.. For example, you may have a static list of all your retail store locations, which you could use to support ongoing analysis. |
Data Volume | Volume: how large is the dataset? |
Data Value | We need to understand how valuable this data is for the business. Does it provide us with insight to support cost saving initiatives? Does it help us drive more revenue? |
Data Quality Score (1 (low) to 10 (high)) | Undertake an assessment to determine how reliable the data we have is? How confident are we in insights derived from it? |
Known Quality Issues | What issues led to a lower score? |
Data Quality Monitoring | How is data quality monitored? |
Data Quality auditing | How is data quality audited? |
Data Variability | How quickly does the data change over time? If the format of data changes frequently, we will likely need more rigorous data quality monitoring in place. |
Datasource Metadata | Is the datasource properly described (field descriptions etc..). Link to the documentation. |
Datasource Lineage | Link to the document which describes how this data has been manipulated into its current form. |
Datasource retention period | How long is data retained for? What policy underpins this? |
Datasource owner | Who owns the datasource? |
Datasource Steward | Who has stewardship responsibilities on the data? |
Access management process | Include here how access is provisioned; how it is approved; how it is periodically reviewed |
Data re-Use policy | How is this data intended to be used? Should it be a self-serve dataset? |
An example might be:
Data Source Name | product.customers_details |
Data Source Platform | MySQL |
Datasource Users | Internal |
Entities Defined | Customers |
Aggregated dataset | No |
Cleaned dataset | No |
Risk: PII | Yes – Phone, Email, Age, Gender |
Risk: PII Purpose | Required to enhance usage of website, advertising products suitable for age & gender |
Risk: PII mitigation | Yes |
Access Requirements | Marketing (where customer has opted in) |
Data Type | Master Data |
Data Volume | 10GB |
Data Value | Drives significant revenue; reduces marketing costs through targeted marketing. |
Data Quality Score (1 (low) to 10 (high)) | 10 |
Known Quality Issues | No known issues in data quality |
Data Quality Monitoring | Data ingestion process has email and phone validation |
Data Quality auditing | No process at present |
Data Variability | The data is consistently formatted |
Datasource Metadata | Link to metadata here. |
Datasource Lineage | Link to lineage description here. |
Datasource retention period | Until account closure request by customer |
Datasource owner | Head of Marketing |
Datasource Steward | Marketing Analytics Analyst |
Access management process | Link to access management process here. |
Data re-Use policy | This dataset should not be used for purposes outside of marketing without further approval and review. |
As always, this sort of template may or may not meet your needs. It’s really driven by the organization you work in & the type of data they hold. However, hopefully it’s a good starting point!