Ever had an encounter with the most…awkward programming language?
Surely, awk may have a slightly awkward name, possibly a bit awkward syntax at times but overall it’s far from being considered awkward as a language. In fact, it is one of the most powerful tools any programmer should possess to nail simple or more complex operations on a Unix-based platform.
Today, I’m going to quickly show you how to extract some simple statistical metrics from a numerical list stored in a file (steps.txt
), and these are: min, max and mean.
Let’s say you have a file with the number of steps you’ve walked during a week starting on Monday (order of days doesn’t really matter in this case and let’s consider for simplicity that the file doesn’t contain the ‘#’ comment part with the day names):
5,101 # Mon 10418 # Tue 4127 # Wed 8912 # Thu 11100 # Fri 1309 # Sat -- looks like too much Netflix over the weekend 😀 1124 # Sun
Normally, you would write a Python
or R
script for these tasks, and in Python it would look something like that:
import numpy as np import pandas as pd df = pd.read_csv('steps.txt', header=None) steps = df.iloc[:, 0] # min print(np.min(steps)) # max print(np.max(steps)) # mean print(np.mean(steps))
That’s pretty neat of course, but there may be cases where you want to do that instantly when you’re in front of your favourite shell terminal. Besides, when your input file is very large (e.g. 20GB) and you don’t have enough physical memory to store all of its contents there, it may actually be much more straightforward to calculate the min
, max
and mean
using awk
instead of Python
. Of course, you can still do that with Python
too but in that case you should be reading your file into chunks that can fit in your machine’s physical memory, do the calculations within each chunk and then merge your results.
When it comes to awk
though, you can just start parsing your file line by line – so no need to store any big data into memory – and do the calculations as below:
# min awk 'NR == 1 || $1 < min {min = $1} END {print min}' steps.txt
# max awk 'NR == 1 || $1 > max {max = $1} END {print max}' steps.txt
# mean awk '{sum += $1} END {print sum / NR}' steps.txtor have all together in a single line:
$ awk 'NR == 1 || $1 < min {min = $1}; NR == 1 || $1 > max {max = $1}; {sum+=$1} END {print "Min: " min; print "Max: " max; print "Mean: " sum / NR}' steps.txtFor further usability you can add it as a function in your ~/.bashrc:
function awks { awk 'NR == 1 || $1 < min {min = $1}; NR == 1 || $1 > max {max = $1}; {sum+=$1} END {print "Min: " min; print "Max: " max; print "Mean: " sum / NR}' $1 }source your bashrc (. ~/.bashrc) and then you’ll be able to call it like that:
$ awks steps.txtand…voilà!
Min: 1124 Max: 11100 Mean: 6013
Getting some other stats like median or other percentiles with
awk
does get a bit more complicated and probably not very efficient. But just for demonstration purposes, to calculate the median you should first sort the values in your file file, then parse it -storing each value into a dictionary- and eventually get the median based on the total length of your list being either an odd or an even number:
$ sort -n --parallel=[num_of_cores] steps.txt | awk '{vals[NR] = $1} END{if (NR % 2) {print vals[(NR + 1) / 2]} else { print(vals[(NR / 2)] + vals[(NR / 2) + 1]) / 2.0}}'Hope you have an aw…esome experience with awk! 🙂