Random Line From File
The problem here, assuming a unix file, is that one does not know how many lines there are, nor where the lines are. So one might think to get the number of lines, roll a random number in that range, then go back through the file to find and print that line number.
awk -v line_idx=$(( RANDOM % $(wc -l "$1" | awk '{print $1}') + 1 )) 'NR == line_idx {print;exit}' "$1"
The original code lacked the ";exit" so would continue to loop over the remaining lines. Not very efficient.
Another option is to reservoir-sample the lines. See "The Art of Computer Programming, Volume 2, Section 3.4.2" (Donald E. Knuth) for details.
perl -e 'rand($.) < 1 && ($p = $_) while <>; print $p'
An advantage here is that one need not know how many lines there are, so this code works with a stream. This algorithm can be ported to C or such if you want more speed. (The shell code could also work with a stream, but would need to save the stream to a temporary file, which would be even less efficient.)
Benchmark
Not very scientific; this would probably would need longer durations and more runs and on several different systems for better results. Higher rates are better.
$ jot 10 > ten
$ perl bench.pl ten
Rate sh pl C
sh 70.7/s -- -48% -80%
pl 136/s 93% -- -61%
C 350/s 395% 157% --
$ jot 100 > hun
$ perl bench.pl hun
Rate sh pl C
sh 82.7/s -- -42% -76%
pl 141/s 71% -- -59%
C 348/s 321% 146% --
$ jot 1000 > thou
$ perl bench.pl thou
Rate sh pl C
sh 83.0/s -- -33% -75%
pl 124/s 50% -- -63%
C 332/s 300% 168% --
$ jot 10000 > tenk
$ perl bench.pl tenk
Rate sh pl C
sh 75.6/s -- -1% -73%
pl 76.7/s 1% -- -72%
C 276/s 264% 259% --
It is somewhat interesting that the Perl gets as bad as the shell code as the number of input lines increases.
gshuf of the coreutils package is another option, but I have no idea what algorithm gshuf uses.
Source
tags #perl #unix