Data Strategy: Data Inventory

The first step we should undertake in order to write a data strategy is exploration / investigation. It’s really important to fully understand your starting point before trying to define what needs to be done.

The first output from this assessment needs to be a data inventory, this helps you to identify the data you hold; the platforms in which it resides and many of the controls around the data that are already in place. An inventory may include some of the below fields:

Data Source NameWhat is the name of the database table; file system directory; etc.. which uniquely identifies this datasource.
Data Source PlatformWhat platform is your data stored in (MySQL, HDFS, etc..)
Datasource UsersIs this customer facing or internal data?
Entities DefinedWhat entities are described in your datasource. For example, customers, orders, transactions
Aggregated datasetIs the dataset aggregated, ready for reporting?
Cleaned datasetIs the dataset cleaned & prepared for end-user usage?
Risk: PIIIs PII included in the dataset? What are they?
Risk: PII PurposeWhy is this PII data required? What is the business justification?
Risk: PII mitigationIs the data encrypted? Hashed?
Access RequirementsWho should have access to the data?
Data TypeMaster Data: this describes objects: employees, customers, locations, office addresses, organizations; products; etc..

Unstructured data: is data without a fixed data structure: bodies of text; social comments; images; videos; documents etc..

Transactional Data: is transactional in nature, it usually has a timestamp associated to it and describes an event: a sale; an invoice; a return or an activity (e.g. the time you entered the gym when you swipe your membership card).

Metadata: describes your data. It’s data about data: report definitions; database column descriptions; config files; etc..

Hierarchical Data: is data that describes objects with a hierarchical structure. For example, a family tree or org structure. In a cellular network; you may have a hierarchical model for family plans. The children are associated to their parents account.

Reference Data: is static data typically. It could include timezone data;  location information etc.. For example, you may have a static list of all your retail store locations, which you could use to support ongoing analysis.
Data VolumeVolume: how large is the dataset?
Data Value We need to understand how valuable this data is for the business. Does it provide us with insight to support cost saving initiatives? Does it help us drive more revenue?
Data Quality Score (1 (low) to 10 (high))Undertake an assessment to determine how reliable the data we have is? How confident are we in insights derived from it?
Known Quality IssuesWhat issues led to a lower score?
Data Quality MonitoringHow is data quality monitored?
Data Quality auditingHow is data quality audited?
Data VariabilityHow quickly does the data change over time? If the format of data changes frequently, we will likely need more rigorous data quality monitoring in place.
Datasource MetadataIs the datasource properly described (field descriptions etc..). Link to the documentation.
Datasource LineageLink to the document which describes how this data has been manipulated into its current form.
Datasource retention periodHow long is data retained for? What policy underpins this?
Datasource ownerWho owns the datasource?
Datasource StewardWho has stewardship responsibilities on the data?
Access management processInclude here how access is provisioned; how it is approved; how it is periodically reviewed
Data re-Use policyHow is this data intended to be used? Should it be a self-serve dataset? 

An example might be:

Data Source Nameproduct.customers_details
Data Source PlatformMySQL
Datasource UsersInternal
Entities DefinedCustomers
Aggregated datasetNo
Cleaned datasetNo
Risk: PIIYes – Phone, Email, Age, Gender
Risk: PII PurposeRequired to enhance usage of website, advertising products suitable for age & gender
Risk: PII mitigationYes
Access RequirementsMarketing (where customer has opted in)
Data TypeMaster Data
Data Volume10GB
Data Value Drives significant revenue; reduces marketing costs through targeted marketing.
Data Quality Score (1 (low) to 10 (high))10
Known Quality IssuesNo known issues in data quality
Data Quality MonitoringData ingestion process has email and phone validation
Data Quality auditingNo process at present
Data VariabilityThe data is consistently formatted
Datasource MetadataLink to metadata here.
Datasource LineageLink to lineage description here.
Datasource retention periodUntil account closure request by customer
Datasource ownerHead of Marketing 
Datasource StewardMarketing Analytics Analyst
Access management processLink to access management process here.
Data re-Use policyThis dataset should not be used for purposes outside of marketing without further approval and review.

As always, this sort of template may or may not meet your needs. It’s really driven by the organization you work in & the type of data they hold. However, hopefully it’s a good starting point!

Previous Article
Basic PostgreSQL commands you need to know

Basic PostgreSQL commands you need to know

Next Article

Implement User Defined Functions in SQL (MySQL)

Related Posts