The first step we should undertake in order to write a data strategy is exploration / investigation. It’s really important to fully understand your starting point before trying to define what needs to be done.
The first output from this assessment needs to be a data inventory, this helps you to identify the data you hold; the platforms in which it resides and many of the controls around the data that are already in place. An inventory may include some of the below fields:
|Data Source Name||What is the name of the database table; file system directory; etc.. which uniquely identifies this datasource.|
|Data Source Platform||What platform is your data stored in (MySQL, HDFS, etc..)|
|Datasource Users||Is this customer facing or internal data?|
|Entities Defined||What entities are described in your datasource. For example, customers, orders, transactions|
|Aggregated dataset||Is the dataset aggregated, ready for reporting?|
|Cleaned dataset||Is the dataset cleaned & prepared for end-user usage?|
|Risk: PII||Is PII included in the dataset? What are they?|
|Risk: PII Purpose||Why is this PII data required? What is the business justification?|
|Risk: PII mitigation||Is the data encrypted? Hashed?|
|Access Requirements||Who should have access to the data?|
|Data Type||Master Data: this describes objects: employees, customers, locations, office addresses, organizations; products; etc..|
Unstructured data: is data without a fixed data structure: bodies of text; social comments; images; videos; documents etc..
Transactional Data: is transactional in nature, it usually has a timestamp associated to it and describes an event: a sale; an invoice; a return or an activity (e.g. the time you entered the gym when you swipe your membership card).
Metadata: describes your data. It’s data about data: report definitions; database column descriptions; config files; etc..
Hierarchical Data: is data that describes objects with a hierarchical structure. For example, a family tree or org structure. In a cellular network; you may have a hierarchical model for family plans. The children are associated to their parents account.
Reference Data: is static data typically. It could include timezone data; location information etc.. For example, you may have a static list of all your retail store locations, which you could use to support ongoing analysis.
|Data Volume||Volume: how large is the dataset?|
|Data Value||We need to understand how valuable this data is for the business. Does it provide us with insight to support cost saving initiatives? Does it help us drive more revenue?|
|Data Quality Score (1 (low) to 10 (high))||Undertake an assessment to determine how reliable the data we have is? How confident are we in insights derived from it?|
|Known Quality Issues||What issues led to a lower score?|
|Data Quality Monitoring||How is data quality monitored?|
|Data Quality auditing||How is data quality audited?|
|Data Variability||How quickly does the data change over time? If the format of data changes frequently, we will likely need more rigorous data quality monitoring in place.|
|Datasource Metadata||Is the datasource properly described (field descriptions etc..). Link to the documentation.|
|Datasource Lineage||Link to the document which describes how this data has been manipulated into its current form.|
|Datasource retention period||How long is data retained for? What policy underpins this?|
|Datasource owner||Who owns the datasource?|
|Datasource Steward||Who has stewardship responsibilities on the data?|
|Access management process||Include here how access is provisioned; how it is approved; how it is periodically reviewed|
|Data re-Use policy||How is this data intended to be used? Should it be a self-serve dataset?|
An example might be:
|Data Source Name||product.customers_details|
|Data Source Platform||MySQL|
|Risk: PII||Yes – Phone, Email, Age, Gender|
|Risk: PII Purpose||Required to enhance usage of website, advertising products suitable for age & gender|
|Risk: PII mitigation||Yes|
|Access Requirements||Marketing (where customer has opted in)|
|Data Type||Master Data|
|Data Value||Drives significant revenue; reduces marketing costs through targeted marketing.|
|Data Quality Score (1 (low) to 10 (high))||10|
|Known Quality Issues||No known issues in data quality|
|Data Quality Monitoring||Data ingestion process has email and phone validation|
|Data Quality auditing||No process at present|
|Data Variability||The data is consistently formatted|
|Datasource Metadata||Link to metadata here.|
|Datasource Lineage||Link to lineage description here.|
|Datasource retention period||Until account closure request by customer|
|Datasource owner||Head of Marketing|
|Datasource Steward||Marketing Analytics Analyst|
|Access management process||Link to access management process here.|
|Data re-Use policy||This dataset should not be used for purposes outside of marketing without further approval and review.|
As always, this sort of template may or may not meet your needs. It’s really driven by the organization you work in & the type of data they hold. However, hopefully it’s a good starting point!