Red Green Repeat Adventures of a Spec Driven Junkie

Text Processing Tools in UNIX

This article is the first in a series which I write about my exploration in bringing more security to a server. This part will talk about basic text processing tools which will be used in the next part.

Part one of four on more server security:

  1. Text Processing Tools in UNIX
  2. Parsing SSH logs
  3. Configuring fail2ban
  4. Configuring mail (and Slack) for SSH Notifications

If you know cat, |, grep, awk, sed, wc, uniq, and geoiplookup, you can probably skip this article (or tell me if I can improve!)


Previously, I wrote about how to setup Two-Factor Authentication with SSH on a server.

One thing I never did check: how secure was that setup? Now, I want to be able to verify the security of a server setup.

The best place to start is: check the logs. Logs will tell what is going on: where are log in attempts coming from? Successful log in? What user name or ports are being used? and more!

When I started to process logs, I realized there is more data than can be process by hand. That’s when using UNIX text processing utilties such as: cat, |, grep, awk, sed, wc, sort, uniq, and geoiplookup came in very handy!

In this article, I will cover these utilities and examples of how I use them.

UNIX Text Processing Tools

I have used grep frequently to look up a variable name in a code base but never the any others, let alone put each together. I never quite understood how things ‘connected’, then I looked up the UNIX philosphy:

Write programs that do one thing and do it well. Write programs to
work together. Write programs to handle text streams, because that
is a universal interface.

Now I see each UNIX text processing program to be a primitive and does its job very well. Bigger programs can be created by combining these primitives. Understanding these primitives will help create many bigger programs.

Some essential primitive UNIX programs:

  • cat
  • | (UNIX pipe)
  • grep
  • awk
  • sed
  • wc
  • sort
  • uniq
  • geoiplookup

The only command that is not part of the standard UNIX system is geoiplookup, which is a tool to look up geography information about an IP address. This tool will come in handy in part two.

All commands have documentation in UNIX under man:

$ man command


cat, short for concatenate, is used to initiate the feeding of the input file from one program into another.

Most programs can read the file on their own and it works very well. (i.e. grep) By using cat to read the file, each individual program can be chained together through a | and the program can focus on the input and not be bothered by reading file input as it is fed in from cat.

cat defaults its output to $STDOUT, which is usually the command-line, unless redirected with a |, >, ».


  • cat file - send ‘file’ onto terminal
  • cat file > file2 - send contents of ‘file’ into ‘file2’
  • cat file >> file3 - send the contents of ‘file’ to the end of ‘file3’

| (UNIX pipe)

The |, also known as the UNIX pipe, command directs output from the left side of the argument into the right. I use | to provide input from one program to another.

By using a |, there does not need to be an intermediary file to be written or read. This also makes debugging tricky as one needs to really understand intermediary input and outputs. (Functional programming!)


  • cat file | program one - send ‘file’ into program one
  • cat file | program one | program two - send ‘file’ into program one, then onto program two


grep is used to return lines matching an expression. An expression can be a direct match: (i.e. grep ssh <file> will match any line with the word ‘ssh’ from the file.

grep will also match a Regular Expression, regex, as well. (i.e. grep -oE '.*([0-9]{1,3}\.){3}[0-9]{1,3}.*' file will only return lines from file have IP addresses.)

I usually grep directly against a file: grep <item to match> <file(s)> or whole project: grep -R <item to match> <directory>

When grep is used with cat, the form is simpler as file is not required because cat is supplying the file.


  • cat file | grep <item> - return lines which match ‘item’
  • cat file | grep -oE '<regex>' - return lines matching ‘regex’
  • cat file | grep -oE '.*([0-9]{1,3}\.){3}[0-9]{1,3}.*' - return any line with an IP Address.


awk is a tool new to me and I have found it awesome to get very ‘tabular’ data. If I want to get the 3rd word of every line, Awk is the tool!

Awk is way more extensive than I realized: awk is its programming language!


  • cat file | awk { print $n } - prints the n-th column data of an input
  • cat file | awk '/expression/{ print $0 }' - prints any line matching ‘expression’ (equivalent to grep -oE '/expression/')


sed, short for stream editor, is another tool that I discovered recently. I use sed to clean up input for another processing through its regex matching:


  • cat file | sed -n -e 's/stuff//p' - removes all “stuff” from file.


wc, short for word count, which counts stuff. words, lines, bytes, whatever you want!


  • cat file | wc -l - counts the number of lines in file.


sort takes all input and sorts it.


  • cat file | sort - return file in sorted order
  • cat file | sort -n - return file sorted in numeric order


uniq filters out redundant items and/or count the number of unique items in the input.

Note: uniq only works on sorted input. Sorting the input before uniq is necessary if a count of items is required.


  • cat file | uniq - returns all the unique items in the ‘file’.
  • cat file | sort | uniq -c | sort -n returns the count a line is repeated in the ‘file’.


geoiplookup is used perform geolocation lookup of an IP address. It is most useful when wanting to find out where an IP address originates from, which will be useful in part two.

Information on installing and updating the database. I used the free version of maxmind’s geolite tool.


  • geoiplookup look up country level geography information of IP
  • geoiplookup -f geo_file - look up ‘geo_file’ level geography information of IP
  • cat file | awk '/Invalid user/{ print $0 }' | awk '{ print $10 }' | awk '{ system("geoiplookup " $2) }' - get all lines which have an invalid user on them, then grab the value at the 10th column, and perform a geoiplookup on the value.


I have covered basics of UNIX text processing tools, cat, |, grep, awk, sed, wc, sort, uniq, and geoiplookup. I have included examples of how I use them.

In the next article, these tools will be applied to a SSH log to find out what is going on in a system.