Shell while Loop Considered Harmful
$ printf 'one\ntwo' | while read l; do echo "$l"; done
one
$
Whoops, silent data loss. POSIX requires that text files end in an ultimate newline to be considered text files, but in practice that ultimate newline may be absent and then...yeah. How many shell scripts are out there and how many ultimate lines have been lost this way?
A less terrible version is, well, verbose.
$ printf 'one\ntwo' | while IFS= read -r l || [ -n "$l" ]; do printf '%s\n' "$l"; done
one
two
And, slow. Do you want under a second (C, Perl), or seven seconds (ksh)?
$ perl -E 'say for 1..1_000_000' | time ./shellwhile >/dev/null
0m07.34s real 0m03.48s user 0m03.80s system
$ perl -E 'say for 1..1_000_000' | time ./perlwhile >/dev/null
0m00.71s real 0m00.45s user 0m00.25s system
$ perl -E 'say for 1..1_000_000' | time ./cwhile >/dev/null
0m00.58s real 0m00.03s user 0m00.11s system
Typically the shell will be a order, or orders, of magnitude slower, especially when it forks external tools; the above uses the shell internal echo to make the numbers for the shell less bad. echo is not portable nor safe for random input, but the portable printf(1) involves a fork which would make the shell performance even worse.
Okay for fast prototyping, terrible for most anything else.
Therefore, if shell code will have something tricky and non-performant like a while loop in it, I generally write that code in some other language.
The Benchmark Code
$ cat shellwhile
#!/bin/ksh
while IFS= read -r l || [ -n "$l" ]; do echo "$l"; done
$ cat perlwhile
#!/usr/bin/perl
print while readline;
$ cat cwhile.c
#include
int
main(void)
{
char *line = NULL;
size_t linesize = 0;
ssize_t linelen;
while ((linelen = getline(&line, &linesize, stdin)) != -1)
fwrite(line, linelen, 1, stdout);
return 0;
}
$ cat lispwhile.lisp
(defun main ()
(loop for line = (read-line *standard-input* nil nil) while line do
(write-line line)))
(sb-ext:save-lisp-and-die "lispwhile" :executable t :toplevel 'main)
Pattern Recognition
Commands or groups of commands that are often run should probably be rewritten to be more efficient; consider counting the most frequent of input lines, which might realistically be some portion of a logfile:
$ printf 'a\nb\na\na\nb\nc\n' | sort | uniq -c | sort -nr
3 a
2 b
1 c
This gets the job done, but is slow. Somewhat faster is to place all the lines into a hash of line => count pairs, and then to sort by the count, all within a single process. More efficient, but you have to actually notice the pattern, worry about the CPU waste, and then write a specific tool for it.
Rate shell perl tally
shell 88.0/s -- -41% -61%
perl 149/s 70% -- -34%
tally 227/s 158% 52% --
tags #ksh #c #sh #perl #lisp