The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on the planet and be able to fully configure and manage multiple cloud environments.

I’ve been trawling job ads on the web to separate fact from fiction and get a real view of the skills that are needed in the industry, to help support the creation of a study pathway for aspiring data scientists. Below are my key takeaways (and opinions) from this analysis; of course, this is not an exhaustive list and every company has a different requriement for their data scientists.

Data Scientist Skills

PYTHON

It’s no surprise that Python features as the most frequent skill required by a data scientist. But, Python is a big topic, what do you really need to know?

In my opinion, a data scientist should also be a very strong data engineer; there is a huge overlap in their workloads and in most businesses, the lines between the two roles are blurred.

The list of concepts I would become comfortable with is:

  • Python basics: variables; data structures; loops; conditional flow; functions
  • Getting data from a variety of sources: learn to extract data from: relational databases; APIs (e.g. the Twitter/YouTube API); CSV files; XML files; JSON files; web scraping
  • Clean your data: now you’ve got the data, learn how to handle missing values; clean messy data; make new features/columns from files (e.g. converting date formats; carrying out fuzzy-lookups to clean user-responses)
  • Analyse your data: using the Pandas and Numpy libraries to aggregate your data and start understanding it better
  • Visualize your data: using the Matplotlib, Seaborn or Plotly libraries to visually understand your data
  • Carry out NLP: learn to take text-based data and understand the sentiment; the topics discussed (using topic modelling) – using the Spacy and NLTK libraries.
  • Forecast timeseries data: ingest your data into a model to determine what the value will be 6 weeks from now
  • Do some data exploration and feature engineering: understand your data; look for correlations; look for interesting properties of your data and engineer some new features.
  • Prepare your data for modelling: most machine learning models can’t deal with datetime stamps or categorical data. Learn to encode and standardize your datasets.
  • Start building supervised machine learning models: starting with the most basic models (linear / logistic regression), start to build models to predict classification and regression problems – using scikit learn. Learn how to tune the model to make it more accurate.
  • Get a little bit more complex with your models: build out a random forest and a gradient boosted machine algorithm – use gridsearch cv to tune the model; use principal component analysis to reduce the dimensionality of significant datasets.
  • Underatand the ML algorithm accuracy: understand the confusion matrix; ROC curves and other model accuracy scores. I’ve discussed this here.

MATHEMATICS / STATISTICS

My view here is that you need to understand the statistical concepts and how to interpret the output but you’re very unlikely to need to manually calculate things. I’ve written an article here which lists my opinion on the statistical concepts you should be comfortable with, they are:

  • Data Types: Numerical: Discrete; Numerical: Continuous and Categorical
  • Measures of central tendency: Mean, Median, Mode; Weighted Mean
  • Point estimates & confidence intervals
  • Percentiles / quartiles / box plots
  • Distributions: Skew; Normal Distribution; Standard Normal Distribution; Central Limit Theorem; Standard Error
  • Measures of variability: Range; Variance; Standard Deviation; Z-Score
  • Relationships: Correlation; Covariance
  • Probabiltiy: AdditionRule; Multiplication Rule; Conditional Probability; Testing For Independence; calculating possible permutations
  • Hypothesis: Null Hypothesis; Alternative Hypothesis; p-value
  • Regression Tables: R; R-Squared; Adjusted R-Squared; F-Statistic; Coefficient; Standard Error of Estimate

SQL

SQL is to data, what yeast is to bread – absolutely necessary. It’s simplicity makes data accessible to anyone in the business and has become the default way to access data. Many solutions (not just SQL Server) have adopted an SQL-Like language for data access, including: Apache Hive, PostgreSQL, MySQL, Google Big Query and more. I believe that understanding SQL will stand you in good stead for the rest of your career. The key concepts of SQL which you should be comfortable with are. You can learn these, here.

  • Selecting and filtering data
  • Aggregating / grouping data
  • Select distinct / count distinct
  • Order by / sort by
  • Casting into different datatypes
  • Sub Queries / nested queries
  • Joining datasets
  • Union datasets
  • Case statements
  • Windowing functions (rank, row, lead, lag)
  • String functions
  • Create tables / alter tables / partition tables / insert into tables / drop tables

APACHE SPARK

Apache Spark is the final technology I am going to mention from this list. It features relatively frequently in job ads and enables us to work with massive datasets. You can learn the basics of Pyspark here. The key concepts are:

  • Ingesting data into dataframes from databases or files
  • Inspecting your dataframe to understand the shape of the data
  • Handling null and duplicate values
  • Selecting and filtering data in your dataframe
  • Running SQL on your dataframes
  • Adding new, calculated columns to the dataframe
  • Group by and aggregations of data in your dataframe
  • Writing the dataframe to files / database tables

OTHER HONOURABLE MENTIONS

Other skills which were mentioned quite frequently in job ads were: R; Git; NoSQL databases (e.g. MongoDB); Scala and PowerBi.

There were skills mentioned which I did not incorporate into my analysis. These surrounded generic terms of visualization; architecture/ cloud infrastructure and things of that nature.

I would encourage you to spend some time analysing job descriptions for yourself too, as there will always be other things which you pick out, which I may have missed.

Kodey