An introduction to data structures for aspiring data engineers

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below):

Data Types

Data types are ways to classify data. For example, integer, string and boolean. These assignments / classifications determine what kind of values we can store in our data structures (outlined below).

A string, is simply a text value. An example of a string is “The Tower of London”. An integer is a whole number, for example 14 and a boolean value is True or False.

Data types are either built-in (e.g. integers, strings, booleans) or derived data types (such as lists, arrays and queues – discussed below).

Arrays

Arrays are containers which can hold a fixed number of objects. Each object within the array must be of the same data type. Below, I have created a visual representation of an array. This array has five items in it and all of those items are integer values.

2731481629

Each item in the array is indexed. The index tells us the position of the item within the array. In most programming languages, the array index starts from zero, as is depicted in the below example. Here, index 2 is equal to 48.

2731481629
01234

In the below, I have shown how we may implement an array in Python, along with the output of selecting array index 2.

from array import * my_array = array('i', [27, 31, 48, 16, 29])
my_array[2]
Output:
48

Within Python, you will note we have data structures like the list and a tuple. These are very similar in many ways to an array. The below table will give you an overview of the differences:

CriteriaArrayListTuple
Mutable*YESYESNO
OrderedYESYESNO
All items must be the same data typeYESNOYES
Need to be declared**YESNONO
Storage Efficiency***HIGHLOWLOW
Numerical operationsYESNONO

*Mutable means you can change the content within the data structures. With an array and a list, you can update a given index value, while with a Tuple, you cannot – once you have defined your tuple, there is no way to change it.

**When we create an array, you will  note that we needed to declare it as such. While, using a list or Tuple, we do not – they are built in Python data structures.

***Arrays are good for lots of data as they store the values very efficiently.  They also allow us to carry out numerical operations directly on the array (e.g. my_array/10, will divide every item within the array by 10.

In the below, I have shown the process to define a list and tuple in Python. Unlike an array, we did not need to import any modules and we did not have to explicitly define these as lists or tuples, as they are built into the Python syntax.

my_list = [27, 31, 48, 16, 29]
my_tuple = (27, 31, 48, 16, 29)

Vectors

A vector is an ordered list of scalar values (e.g. 1, 5, 6, 9) and is denoted as a bold lower-case letter. For example b = [1, 3]. These can be visualised as arrows, as below. The magnitude (size) of the arrow & it’s direction, can give you a good deal of intuition visually.

Programatically, we don’t define a vector as a particular data structure (like a list or an array). It’s just what we call a 1D array when the dataset can be represented by magnitude (length) and direction. We use them to present physical quantities (like velocity or acceleration).

Matrices

An array can be considered to be a single row within a table; it’s one dimensional. Whereas, a matrix is a multi-dimensional array. Let’s look at some data.

In the below, row 1 of data, is the same as our array we defined above. That array, simply reflected the first row of the table below.

col1col2col3col4col5
2731481629
5597135422
8152643211

What happens if we want to store all of the data though? We have three rows of data, we don’t want three arrays. Well, this is where we use Matrices – these are essentially arrays within arrays, giving a columnar structure to our data.

We can simply define this as shown below, using some sample Python code. You can see, this has been defined as a list, containing three lists. Each of the three internal lists can be considered as a row from the table above.

my_matrix = [[27, 31, 48, 16, 29],             
             [55, 97, 13, 54, 22],              
             [81, 52, 64, 32, 11]]
my_matrix
Output:
[[27, 31, 48, 16, 29], [55, 97, 13, 54, 22], [81, 52, 64, 32, 11]]

If we want to access a particular value, we now have a multi-level index. As you can see below, accessing my_matrix index item one, gives you row number 2 (because indexing starts at zero) of your matrix.

my_matrix[1]
Output:
[55, 97, 13, 54, 22]

To access a particular item, we can use two indexes. In the below, we’re accessing row number one, item two.

my_matrix[1][2]
Output:
13

Mathematically, a matrix is denoted with a bold upper case letter. In the below, matrix H is defined by simply enclosing values between square brackets.

H = [10, 16, 12

         2,   4,  7  ]

I have included the positioning in the below. Here, H(1,1) would be equal to 10; H(1,2) would be equal to 16 and so on.

H = [H(1,1), H(1,2), H(1,3)

         H(2,1),   H(2,2), H(2,3) ]

Hash Tables / Hashmaps / Dictionaries

A hash table (also known as a hashmap and known in Python as a dictionary) is a key-value data structure. If we think about the below table; we may wish to use ID as the key. That is; when I search the table, I want to search by ID and return all the values associated to the ID.

IDAgeHeight_CM
Bob34145
Sarah18165

Using a hash table makes the process of searching our data and inserting new data much faster than alternative data structures. Below, I have provided a small code snippet in Python to show how the above table may be implemented. I have then searched the dictionary for the key ‘Bob’ – you can see that it has returned an output, showing Bob’s age and height.

my_datastructure = { 'Bob': {'Age': 34, 'Height_CM' : 175},                     
                     'Sarah' : {'Age' : 18, 'Height_CM' : 165} }
my_datastructure['Bob']
Output:
{'Age': 34, 'Height_CM': 175}

Queues 

Much like queues in real life, a queue as a data structure allows items to be added to the back and serviced from the front. Let’s look at how this might work.

In the below, I define my_queue as a new list. I then insert a value at position 0 of the list. When I print the output, you can see, the value has been added to the list successfully. Next, I add one more value, also at position zero of the list. Now, upon printing, you can see, it’s joined the end of the queue – as it appears before the original value I inserted.

my_queue = list()
my_queue.insert(0, 'Val1')
print(my_queue)
my_queue.insert(0, 'Val2')
print(my_queue)
Output:
['Val1']
['Val2', 'Val1']

As items are processed, we can use the pop() function to remove them from the front of the queue, as demonstrated below, where you can see – the first item that joined the queue, has now been removed from the list.

my_queue.pop()
print(my_queue)
Output:
['Val2']


In the above, we have used lists to manage our queue. There is also a Deque library, which performs in a similar manner.

Kodey