Generate mock data to test your pipeline

Quite often, we need to test our pipelines work at scale without having access to production systems. To help solve this, we can generate mock data using the Python library ‘Faker’.

Faker is a comprehensive fake data library. They have data surrounding: customers, addresses, bank details; company names; credit card details; currencies; cryptocurencies; files; domain names; ip addresses; emails; user agents and plenty more.

Let’s walk through the process to generate some fake weblogs. First, let’s import the key libraries we’re going to use.

from faker import Faker
import pandas as pd
from random import randint
from faker.providers import internet, address, automotive, bank, barcode, company, credit_card, currency, date_time, file, geo, job, misc, person, phone_number, user_agent

In weblogs, we will see a number of things, including:

  • Date/Time of request
  • Some sort of customer identifier
  • Domain name accessed
  • Data volume used (upload)
  • Data volume used (download)
  • Client IP address
  • Server IP address
  • Country of access
  • Filename accessed
  • File Type
  • HTTP method used
  • Client user agent

With Faker, generating these is very easy. In the below, I have defined an empty list called output. I then, loop 1,000 times through our function to generate 1,000 lines of data. Each iteration, the new data gets appended to the list (creating a list of lists). This is then converted into a dataframe to give us some nice data.

We have generated data in most cases using Faker, however, in a few scenarios, I have coupled this with selecting random integers between certain ranges for usage details.

output = []
fake = Faker('en-GB')

    
def generate():
    date_time = fake.date_time()
    msisdn = fake.msisdn()
    domain = fake.domain_name()
    dlbytes = randint(50, 1000000)
    ulbytes = randint(50, 1000000)    
    clientip = fake.ipv4()
    serverip = fake.ipv4() 
    filename = fake.file_path()   
    file_type = filename.split('.')[1]
    http_method = fake.http_method()
    user_agent = fake.user_agent()
    
    output.append([date_time, msisdn, domain, dlbytes, ulbytes, clientip, serverip, filename, file_type, http_method, user_agent])
    return output

i = 0
while i < 1000:
    y = generate()
    i = i + 1
    
df = pd.DataFrame(output)
df.columns = ['date_time', 'msisdn', 'domain', 'dlbytes', 'ulbytes', 'clientip', 'serverip', 'filename', 'file_type', 'http_method', 'user_agent']
df

The output of this script is as below. You can see that we have a pretty good dataset here. But, we can definitely do better.

  1. It’s unrealistic to think that all domains will be top level (e.g. example.com). It’s likely, we will have subdomain information too. We can generate that in Faker by simply adding a number into the parenthesis. In this example, we would have 4 levels of subdomain included in our domain data (fake.domain_name(4)).
  2. We may want more control over the filename path included in the data, so we can control the depth of the filepath (e.g. folder/folder/folder/filename.extension).
  3. We may have a specific type of file we want to test, for example, we may only want to test three types of video file: MP4, MOV and AVI.
  4. When you request a webpage, you also request tonnes of other information (e.g. images, javascript files, etc.). We may want to have a column with a comma separated list of these other fields.
  5. You’ll notice above, we have a tonne of irrelevant dates – from a long time ago. We should ensure that all dates are within the last month.

In the below, each of these options is addressed. The domain_name will choose the number of levels by inserting a random number between 1 and 3 into the parenthesis. Similarly, the filepath will control the depth in this way – the file extensions will be chosen at random from the list called ‘video_types’.

Finally, we’ve entered a CSV using Faker data to populare the filename and the HTTP method used to retrieve it as a comma separated list.

from faker import Faker
import pandas as pd
import random
from random import randint
from faker.providers import internet, address, automotive, bank, barcode, company, credit_card, currency, date_time, file, geo, job, misc, person, phone_number, user_agent

output = []
fake = Faker('en-GB')

video_types = ['mp4', 'mov', 'avi']
    
def generate():
    date_time = fake.date_between(start_date='-1m', end_date='today')
    msisdn = fake.msisdn()
    domain = fake.domain_name(randint(1, 3))
    dlbytes = randint(50, 1000000)
    ulbytes = randint(50, 1000000)    
    clientip = fake.ipv4()
    serverip = fake.ipv4() 
    filename = fake.file_path(depth=randint(1, 3), category='video', extension=random.choice(video_types)) 
    file_type = filename.split('.')[1]
    http_method = fake.http_method()
    user_agent = fake.user_agent()
    requests = fake.csv(data_columns=('{{file_path}}', '{{http_method}}'), num_rows=10, include_row_ids=False)
    
    output.append([date_time, msisdn, domain, dlbytes, ulbytes, clientip, serverip, filename, file_type, http_method, user_agent, requests])
    return output

i = 0
while i < 1000:
    y = generate()
    i = i + 1
    
df = pd.DataFrame(output)
df.columns = ['date_time', 'msisdn', 'domain', 'dlbytes', 'ulbytes', 'clientip', 'serverip', 'filename', 'file_type', 'http_method', 'user_agent', 'requests']
pd.set_option('display.max_colwidth', None)
df

Looking at the output below, we now have a very strong test dataset. We can continue to tweak it to meet our needs, but we can generate as many rows as we need for our testing by simply changing the value in the while loop.

Kodey