Project 1 (Print file):

This is as simple use case, we are simply going to print the contents of a txt file. To do that, we simply use:

awk ‘{print}’ filename.csv

Here, the single quotes denote the start of the program and the curly braces show the start of an action within that program. In this case, the action is print.

Project 2 (Print field(s) from file):

Awk is pretty handy for dealing with columnar data. By default, it treats each word as a field (space delimited), but we can change that: awk -F, for comma delimited; awk -F; for semi colon delimited and awk -F| for pipe delimited.

awk -F, ‘{print $1}’ filename.csv

In the above, we use a number to access a field. $0 means we’re accessing the whole line; $1 is field one, $2 is field two and so on. We can print multiple columns from a file by comma separating the field numbers with a comma:

awk -F, ‘{print $1, $2}’ filename.csv

Project 3 (calculations on a field & casting):

In the below, we can take field1, divide it by 1,000 and  print it alongside field 2

awk -F, ‘{print $1/1000, $2}’ filename.csv

The output of that is horrible, we can make it more human readable by casting to int, to truncate the value and get rid of the decimal points:

awk -F, ‘{print int($1/1000), $2}’ filename.csv

That’s great if the number is 10.3 and it should be rounded down. But 10.9 should really be rounded up to 11.

We’ll look at that next.

Project 4 (functions in Awk):

For this, we’re going to create a file. Let’s call it, round_numbers.awk. Inside that file, we’re going to write the below piece of code. This code:

  1. Defines a function call round into which we pass the value n – we’ll talk about what n is in a second
    1. In this function, we take the value of n and add 0.5. Why? Well, imagine you have the number 10.3, if we added 0.5, it would be 10.8, so if we truncated the number, it would correctly round it down to 10. If we had the number 10.9 and added 0.5, we’d now have 11.4. Which means, when we truncate, we would correctly have rounded up to 11.
    2. We then cast that value to int to actually do the truncation
    3. We then return the output
  1. Next, we print the value of field 1 divided by 1024 after it has been passed into the function we just defined. So, for each value in $1, it gets passed into the round function as n. So n is simply the number from the sheet that we are running the function on.

func round(n) {

    n = n + 0.5

    n = int(n)

    Return n

}

{print $1, round($1/1024)}  

To run the script:

awk -f, awks.awk filename.csv

Project 5 (filtering / where statements):

Now it gets interesting. Usually, you don’t want to work on every row of your file, you’re usually looking for something a bit more specific. 

Hence, we have the below. Here, we’re taking the file and filtering to print only the values where field one is greater than 10.

awk -F, ‘$1 > 10 {print $1, $2}’ filename.csv

We can have multiple statements too. Here, we are finding records where field 2 matches Jori and field 1 is greater than 10.

awk -F, ‘$2 ~ /Jori/ && $1 > 10 {print $1, $2}’ filename.csv

We could alter it slightly to show all records in field 2 that start with a J and where field 1 is greater than 10. This allows you to pass in regex statements and make it as complex as you like.

awk -F, ‘$2 ~ /^J/ && $1 > 10 {print $1, $2}’ filename.csv

Project 6 (Loops):

The final thing we’re going to look at is loops. In the below I have created a file called loop.awk. This for loop:

  1. Sets i to be equal to one
  2. Says, ‘while i is less than n, print the value of i
  3. Increment i by one
  4. We then pass n into the function we defined, from field $1

func printx(n) {

    for (i = 1; i <= int(n); ++i) print i }

{printx($1)}

To run the script:

awk -f loop.awk filename_to_run_on.awk

Project 7 (Find the last row of a file & use it in script):

Here, we can simply use the end file. Which finds the last line of the file, and in the below I have extracted field 2.

awk -F, ‘END{print $2}’ filename.csv

We could even save the output to a variable, called Perm.

perm=$(awk -F, ‘END{print $2}’ filename.csv)

echo $perm

Then, we can pass that into a spark-submit call if we like, so we can use our generated value within our scripts.

spark-submit python_file.py $perm > stdout.log 2>stderr.log

We can then reference that variable in the python file using the below.

+ sys.argv[1] +

Project 8 (Find last modified file & extract final line):

This is a script we can use to find the last modified file.

#!/bin/bash

last_modified_directory=$(ls -t1 | head -1)

echo $last_modified_directory

We can go a step further & take the final line of the last modified file.

#!/bin/bash

last_modified=$(ls -t1 | head -1)

perm=$(awk -F, ‘END{print $2}’ $last_modified)

echo $perm