Project 1 (Print file):
This is as simple use case, we are simply going to print the contents of a txt file. To do that, we simply use:
awk ‘{print}’ filename.csv
Here, the single quotes denote the start of the program and the curly braces show the start of an action within that program. In this case, the action is print.
Project 2 (Print field(s) from file):
Awk is pretty handy for dealing with columnar data. By default, it treats each word as a field (space delimited), but we can change that: awk -F, for comma delimited; awk -F; for semi colon delimited and awk -F| for pipe delimited.
awk -F, ‘{print $1}’ filename.csv
In the above, we use a number to access a field. $0 means we’re accessing the whole line; $1 is field one, $2 is field two and so on. We can print multiple columns from a file by comma separating the field numbers with a comma:
awk -F, ‘{print $1, $2}’ filename.csv
Project 3 (calculations on a field & casting):
In the below, we can take field1, divide it by 1,000 and print it alongside field 2
awk -F, ‘{print $1/1000, $2}’ filename.csv
The output of that is horrible, we can make it more human readable by casting to int, to truncate the value and get rid of the decimal points:
awk -F, ‘{print int($1/1000), $2}’ filename.csv
That’s great if the number is 10.3 and it should be rounded down. But 10.9 should really be rounded up to 11.
We’ll look at that next.
Project 4 (functions in Awk):
For this, we’re going to create a file. Let’s call it, round_numbers.awk. Inside that file, we’re going to write the below piece of code. This code:
- Defines a function call round into which we pass the value n – we’ll talk about what n is in a second
- In this function, we take the value of n and add 0.5. Why? Well, imagine you have the number 10.3, if we added 0.5, it would be 10.8, so if we truncated the number, it would correctly round it down to 10. If we had the number 10.9 and added 0.5, we’d now have 11.4. Which means, when we truncate, we would correctly have rounded up to 11.
- We then cast that value to int to actually do the truncation
- We then return the output
- Next, we print the value of field 1 divided by 1024 after it has been passed into the function we just defined. So, for each value in $1, it gets passed into the round function as n. So n is simply the number from the sheet that we are running the function on.
func round(n) {
n = n + 0.5
n = int(n)
Return n
}
{print $1, round($1/1024)}
To run the script:
awk -f, awks.awk filename.csv
Project 5 (filtering / where statements):
Now it gets interesting. Usually, you don’t want to work on every row of your file, you’re usually looking for something a bit more specific.
Hence, we have the below. Here, we’re taking the file and filtering to print only the values where field one is greater than 10.
awk -F, ‘$1 > 10 {print $1, $2}’ filename.csv
We can have multiple statements too. Here, we are finding records where field 2 matches Jori and field 1 is greater than 10.
awk -F, ‘$2 ~ /Jori/ && $1 > 10 {print $1, $2}’ filename.csv
We could alter it slightly to show all records in field 2 that start with a J and where field 1 is greater than 10. This allows you to pass in regex statements and make it as complex as you like.
awk -F, ‘$2 ~ /^J/ && $1 > 10 {print $1, $2}’ filename.csv
Project 6 (Loops):
The final thing we’re going to look at is loops. In the below I have created a file called loop.awk. This for loop:
- Sets i to be equal to one
- Says, ‘while i is less than n, print the value of i
- Increment i by one
- We then pass n into the function we defined, from field $1
func printx(n) {
for (i = 1; i <= int(n); ++i) print i }
{printx($1)}
To run the script:
awk -f loop.awk filename_to_run_on.awk
Project 7 (Find the last row of a file & use it in script):
Here, we can simply use the end file. Which finds the last line of the file, and in the below I have extracted field 2.
awk -F, ‘END{print $2}’ filename.csv
We could even save the output to a variable, called Perm.
perm=$(awk -F, ‘END{print $2}’ filename.csv)
echo $perm
Then, we can pass that into a spark-submit call if we like, so we can use our generated value within our scripts.
spark-submit python_file.py $perm > stdout.log 2>stderr.log
We can then reference that variable in the python file using the below.
+ sys.argv[1] +
Project 8 (Find last modified file & extract final line):
This is a script we can use to find the last modified file.
#!/bin/bash
last_modified_directory=$(ls -t1 | head -1)
echo $last_modified_directory
We can go a step further & take the final line of the last modified file.
#!/bin/bash
last_modified=$(ls -t1 | head -1)
perm=$(awk -F, ‘END{print $2}’ $last_modified)
echo $perm