Solving a 1993 Programming Challenge in 2022 (Updated)
I recently came across a collection of old (1990s) “programming challenges”. I thought it might be amusing to tackle one of these challenges using technologies from the period in which they were posed, and compare the solution to one using contemporary techniques. In other words, do the same problem in C and Python.
The Problem
Stripped to its essence, the question is to identify, among a list of words,
those words that are not anagrams of another word from the list. In other
words, to identify those entries in the list that are not permutations of
another entry. The comparison is not case sensitive, so that tIeD
and
EdiT
are considered permutations of each other.
The formulation of the challenge contains very detailed specifications of the expected input and output formats, which I’ll quote here in full:
The dictionary will contain no more than 1000 words. Input will consist of a series of lines. No line will be more than 80 characters long, but may contain any number of words. Words consist of up to 20 upper and/or lower case letters, and will not be broken across lines. Spaces may appear freely around words, and at least one space separates multiple words on the same line. (…) The file will be terminated by a line consisting of a single
#
.Output will consist of a series of lines. Each line will consist of a single word that is a relative anagram in the input dictionary. Words must be output in lexicographic (case-sensitive) order.
For my solutions, I added the requirement that the input will always be read from standard input (rather than a named file).
The original challenge provided a sample input:
ladder came tape soon leader acme RIDE lone Dreis peat
ScAlE orb eye Rides dealer NotE derail LaCeS drIed
noel dire Disk mace Rob dries
#
which is expected to produce the following output:
Disk
NotE
derail
drIed
eye
ladder
soon
Solution 1 (Contemporary):
Here is the Python version:
1import sys
2
3# Read words into terms
4terms = []
5for line in sys.stdin:
6 if line.startswith( "#" ):
7 break
8 terms += line.strip().split();
9
10# Create dictionary of sorted words as keys; index of raw term as values
11hashes = {}
12for i, t in enumerate(terms):
13 key = ''.join( sorted( t.lower() ) )
14 hashes.setdefault( key, [] ).append( i )
15
16# Eliminate hashes that occurred more than once
17for key in list( hashes.keys() ):
18 if( len(hashes[key]) > 1 ):
19 del hashes[key]
20
21# Output
22print( "\n".join( sorted( [ terms[ hashes[k][0] ] for k in hashes ] ) ) )
The main “algorithm”, as there is, consists of sorting the letters in each word alphabetically, thus creating a “hash” of each word. Because permutations will hash to the same key, the solution consists of those hashes that occur only once.
Solution 2 (Period):
Experienced C programmers will have read the detailed spec of the input format with a sense of recognition: oh, yes, you need to worry about the length of strings and arrays, explicitly, all the time. And it helps a lot knowing what the upper bounds are.
With this in mind, here is my C solution, which takes full advantage of knowing the maximum size of problem ahead of time:
1#include <string.h>
2#include <stdlib.h>
3#include <stdio.h>
4#include <ctype.h>
5
6#define MAXTERM 1000
7#define MAXLINE 80
8#define MAXWORD 20
9
10int cmp( const void *a, const void *b ) {
11 char *p = (char *)a;
12 char *q = (char *)b;
13 return *p - *q;
14}
15
16int cmp2( const void *a, const void *b ) {
17 char **p = (char **)a;
18 char **q = (char **)b;
19 return strcmp( *p, *q );
20}
21
22int read_and_allocate_terms( char **terms ) {
23 char *line = (char *)malloc( MAXLINE+1 );
24 char *res;
25
26 int words = 0;
27 int i, j; /* start and end of words input line */
28
29 while( 1 ) {
30 res = fgets( line, MAXLINE+1, stdin );
31
32 if( !res ) { /* Should not happen */
33 fprintf( stderr, "Null pointer while reading\n" );
34 }
35
36 if( res[0] == '#' ) { /* stop reading if first char is #; ignore newln */
37 free( line );
38 return words;
39 }
40
41 for( i=0; ; i++ ) {
42 /* Break if end-of-line */
43 if( line[i] == '\n' ) {
44 break;
45 }
46
47 /* Iterate to find start of word */
48 if( line[i] != ' ' ) {
49
50 /* Iterate to find end of word */
51 for( j=i+1; ; j++ ) {
52 if( line[j] == ' ' || line[j] == '\n' ) {
53 terms[words] = strndup( line+i, j-i );
54 words += 1;
55 i = j-1; /* pick up at crr pos of j again: for(i) will inc! */
56 break;
57 }
58 }
59 }
60 }
61 }
62}
63
64int main() {
65 char **terms = (char **)malloc( MAXTERM );
66 char **sorted = (char **)malloc( MAXTERM );
67 char **output = (char **)malloc( MAXTERM );
68 int *res = (int *)malloc( MAXTERM );
69
70 int i, j, words, matches, flag;
71
72 /* populate terms with the input words */
73 words = read_and_allocate_terms_getchar( terms );
74
75 /* duplicate input, lowercase, then sort each word alphabetically */
76 for( i=0; i<words; i++ ) {
77 sorted[i] = strdup( terms[i] );
78 for( j=0; j<strlen(sorted[i]); j++ ) {
79 sorted[i][j] = tolower( sorted[i][j] );
80 }
81 qsort( sorted[i], strlen(sorted[i]), sizeof(char), cmp );
82 }
83
84 /* compare all sortd words to all others; remember those that dont match */
85 matches = 0;
86 for( i=0; i<words; i++ ) {
87 flag = 0;
88 for( j=0; j<words; j++ ) {
89 if( i!=j && strcmp( sorted[i], sorted[j] ) == 0 ) {
90 flag = 1;
91 }
92 }
93
94 /* if no match, remember the index of the trial word */
95 if( flag == 0 ) {
96 res[ matches ] = i;
97 matches += 1;
98 }
99 }
100
101 /* copy, then sort for output */
102 for( i=0; i<matches; i++ ) {
103 output[i] = strdup( terms[res[i]] );
104 }
105 qsort( output, matches, sizeof(char *), cmp2 );
106 for( i=0; i<matches; i++ ) {
107 printf( "%s\n", output[i] );
108 }
109
110 /* clean up */
111 for( i=0; i<words; i++ ) {
112 free( terms[i] );
113 free( sorted[i] );
114 }
115 for( i=0; i<matches; i++ ) {
116 free( output[i] );
117 }
118 free( terms );
119 free( sorted );
120 free( output );
121 free( res );
122
123 return EXIT_SUCCESS;
124}
Notice how much of the program deals only with reading and parsing the
input file! Essentially the entire routine read_and_allocate_terms()
does nothing but read the file, identify individual words, and store
them for later use. Another significant source of boilerplate is the
explicit memory management: not only all the calls to malloc()
(directly and indirectly through strdup()
), but also the crucial
calls to free()
at the end.
The main “algorithm” for detecting duplicate permutations (or their absence) is virtually unchanged from the Python version. Without a readily available hashmap, the program has to iterate over an array of possible hashes. Given the small size of the input, this is not a problem in this case.
Finally: although I tested the code, I make no claims that it will correctly handle edge cases and not leak memory!
Conclusion
At least to me, one of the most welcome changes in programming overall has been the reduction of effort required for “non-essential tasks”: bookkeeping, resource management, I/O. In a similar spirit, I am deeply grateful for the (now) universal availability of basic data structures: hashmaps, resizable arrays, proper strings.
Postscript
It later occurred to me that, given the specifics of the input format,
it might be more convenient to read the input character-by-character,
rather than breaking it into lines. Here is an alternative version of
the read_and_allocate_terms()
routine that uses getchar()
instead:
1int read_and_allocate_terms_getchar( char **terms ) {
2 char *buf = (char *)malloc( MAXWORD );
3 char *p = buf;
4
5 int words = 0;
6 int c, flag; /* flag is 1 if actively scanning word; 0 for whitespace */
7
8 flag = 0;
9 while( c=getchar() ) {
10 if( c=='#' || c==EOF ) {
11 free( buf );
12 return words;
13 }
14
15 if( flag == 0 ) {
16 if( c!=' ' && c!='\n' ) {
17 flag = 1;
18 *p++ = c;
19 }
20 } else {
21 if( c!=' ' && c!='\n' ) {
22 *p++ = c;
23 } else {
24 flag = 0;
25 terms[words] = strndup( buf, p-buf );
26 words += 1;
27 p = buf;
28 }
29 }
30 }
31}
It is amusing to see that employing a “higer-level abstraction” (lines and words, instead of individual characters) in this case leads to more code, not less.
Post-Postscript
An astute reader reminded me of the strtok()
function that breaks a
string into delimiter-separated tokens. So, here is yet another version
of the read_and_allocate_terms()
routine; this one uses strtok()
:
1int read_and_allocate_terms_strtok( char **terms ) {
2 char *line = (char *)malloc( MAXLINE+1 );
3 char *res;
4
5 int words = 0;
6 char *tok;
7 char *delim = " \n";
8
9 while( 1 ) {
10 res = fgets( line, MAXLINE+1, stdin );
11
12 if( !res ) { /* Should not happen */
13 fprintf( stderr, "Null pointer while reading\n" );
14 }
15
16 if( res[0] == '#' ) { /* stop reading if first char is #; ignore newline */
17 free( line );
18 return words;
19 }
20
21 tok = strtok( res, delim );
22 if( tok ) {
23 terms[words] = strdup( tok );
24 words += 1;
25 }
26 while( tok = strtok( NULL, delim ) ) {
27 terms[words] = strdup( tok );
28 words += 1;
29 }
30 } /* end while( 1 ) */
31}
Curiously, this version isn’t actually shorter than the one using getchar()
,
but it is arguably easier to understand, because there is less pointer
arithmetic and the logic is simpler.
That being said, it is no accident that I forgot about strtok()
on my
previous pass: it is not a routine that I am comfortable with. Not only
does it maintain state between invocations (rare for a C function) and
behaves differently on subsequent calls (not idempotent), it also mangles
its input (and will therefore fail mysteriously when invoked on a constant
string). Its Linux man page suggests a certain queasiness, and contains
some actual ambiguity in regards to the semantics of the delim
argument:
it is not entirely clear whether any one of the characters in the supplied
string serves as a delimiter (this is in fact the case), or whether only
the fixed combination of the entire argument separates tokens.