Linux 'awk'

Preview

awk is a tiny language for scanning and transforming text, perfect for logs, CSV/TSV, and ad-hoc analytics. Think of it as a streaming spreadsheet: you filter rows, pick/reorder columns, compute aggregates, and print results—without leaving the shell.

TL;DR (Cheat Sheet)

Run: awk 'pattern { action }' file
Fields: $1, $2, …, whole line: $0
Built-ins:
- FS (input delimiter), OFS (output delimiter)
- RS (record sep), ORS (output record sep)
- NR (global line #), FNR (file-local line #), NF (#fields)
- FILENAME, ARGC, ARGV
Blocks: BEGIN { … } (before input), END { … } (after all input)
CLI: -F, (set FS), -v k=v (pass vars), -f prog.awk (use a file)

Core Patterns You’ll Use Daily

1) Select & reformat columns

# CSV: print 1st & 3rd columns as TSV
awk -F, '{print $1, $3}' OFS='\t' data.csv

2) Skip header, compute totals/avg

awk -F, 'NR>1{n++; sum+=$5} END{print "count="n,"sum="sum,"avg="sum/n}' data.csv

3) Conditional filter (e.g., status ≥ 500)

awk '$9 >= 500' access.log

4) Group-by aggregation (sum by user)

awk -F, 'NR>1{sum[$1]+=$3} END{for (u in sum) print u,sum[u]}' OFS=, tx.csv

5) Header-aware merge of many CSVs

# Keep only the first file's header; print all data rows
awk 'FNR==1 && NR!=1{next} {print}' *.csv > merged.csv

6) Global replacements (log levels, etc.)

awk '{gsub(/WARN/, "WARNING"); print}' app.log

7) Strip bad rows (wrong column count / bad numbers)

awk -F, 'NF>=5 && $3 ~ /^[0-9.]+$/' data.csv

8) Fixed-width parsing

# name = cols 1–10, age = 12–14 (trim trailing spaces)
awk '{name=substr($0,1,10); gsub(/ +$/,"",name); age=substr($0,12,3); print name,age}' OFS=, fixed.txt

9) Deduplicate lines

awk '!seen[$0]++' input.txt

10) Nicely formatted output

awk -F, 'NR>1{printf "%-20s %8.2f\n", $1, $3}' data.csv

CSV Gotchas (and Solutions)

CSV is tricky (quotes, commas inside quotes). GNU awk (gawk) supports token-level parsing with FPAT:

gawk -v FPAT='([^,]*)|("[^"]*")' '
  NR==1{print "user,total"; next}
  { amt = $3; gsub(/"/,"",$1); sum[$1]+=amt }
  END{for (u in sum) print u "," sum[u]}
' data.csv

For fully robust CSV/JSON, consider specialized tools (mlr, xsv, jq). Use awk when the format is predictable.

Date/Time Tricks (gawk)

# Input: 2025-08-20T09:15:32 → hour bucket
gawk -F'[T:]' '{hour=$2; cnt[hour]++} END{for(h in cnt) printf "%02d,%d\n",h,cnt[h]}' events.txt | sort -t, -k1,1n

# Parse epoch & print human time
gawk '{print strftime("%Y-%m-%d %H:%M:%S", $1)}' epochs.txt

Performance Tips

Prefer one awk over multiple pipes. It can filter and format in a single pass.
Pre-set locale for speed on huge data:
```
LC_ALL=C awk '…' bigfile
```
mawk is very fast but lacks some gawk features. busybox awk is minimal. For FPAT, asort()/asorti(), in-place editing, prefer gawk.

In-Place Editing (gawk)

# Replace and write back (gawk extension)
gawk -i inplace '{gsub(/DEBUG/, "INFO"); print}' app.log

(Portable alternative: write to temp file, then mv.)

Mini Cookbook

Top N users by occurrences

gawk '{c[$1]++} END{for(u in c) print c[u],u}' file | sort -nr | head

Join two files by line number (2-column report)

paste ids.txt amounts.txt | awk -F'\t' '{print $1 "," $2}'

Unique rows by key (first field)

awk -F, '!seen[$1]++' data.csv

Histogram of HTTP codes (9th field)

awk '{h[$9]++} END{for(k in h) print k,h[k]}' access.log | sort -k1,1n

When Not to Use `awk`

Complex CSV/JSON/XML (use mlr/jq/a proper parser).
Multi-line records with embedded newlines unless you adjust RS/ORS.

Bottom line: If you can say it in a sentence (“sum column 3 by user, skip header”), you can usually write it in one awk.