Solving a 1993 Programming Challenge in 2022 (Updated)

05 June 2022 (Last Modified: 12 August 2022)

I recently came across a collection of old (1990s) “programming challenges”. I thought it might be amusing to tackle one of these challenges using technologies from the period in which they were posed, and compare the solution to one using contemporary techniques. In other words, do the same problem in C and Python.

The Problem

Stripped to its essence, the question is to identify, among a list of words, those words that are not anagrams of another word from the list. In other words, to identify those entries in the list that are not permutations of another entry. The comparison is not case sensitive, so that tIeD and EdiT are considered permutations of each other.

The formulation of the challenge contains very detailed specifications of the expected input and output formats, which I’ll quote here in full:

The dictionary will contain no more than 1000 words. Input will consist of a series of lines. No line will be more than 80 characters long, but may contain any number of words. Words consist of up to 20 upper and/or lower case letters, and will not be broken across lines. Spaces may appear freely around words, and at least one space separates multiple words on the same line. (…) The file will be terminated by a line consisting of a single #.

Output will consist of a series of lines. Each line will consist of a single word that is a relative anagram in the input dictionary. Words must be output in lexicographic (case-sensitive) order.

For my solutions, I added the requirement that the input will always be read from standard input (rather than a named file).

The original challenge provided a sample input:

ladder came tape soon leader acme RIDE lone Dreis peat
 ScAlE orb  eye  Rides dealer  NotE derail LaCeS  drIed
noel dire Disk mace Rob dries
#

which is expected to produce the following output:

Disk
NotE
derail
drIed
eye
ladder
soon

Solution 1 (Contemporary):

Here is the Python version:

 1import sys
 2
 3# Read words into terms
 4terms = []
 5for line in sys.stdin:
 6    if line.startswith( "#" ):
 7        break    
 8    terms += line.strip().split();
 9
10# Create dictionary of sorted words as keys; index of raw term as values
11hashes = {}
12for i, t in enumerate(terms):
13    key = ''.join( sorted( t.lower() ) )
14    hashes.setdefault( key, [] ).append( i )
15        
16# Eliminate hashes that occurred more than once
17for key in list( hashes.keys() ):
18    if( len(hashes[key]) > 1 ):
19        del hashes[key]
20
21# Output
22print( "\n".join( sorted( [ terms[ hashes[k][0] ] for k in hashes ] ) ) )

The main “algorithm”, as there is, consists of sorting the letters in each word alphabetically, thus creating a “hash” of each word. Because permutations will hash to the same key, the solution consists of those hashes that occur only once.

Solution 2 (Period):

Experienced C programmers will have read the detailed spec of the input format with a sense of recognition: oh, yes, you need to worry about the length of strings and arrays, explicitly, all the time. And it helps a lot knowing what the upper bounds are.

With this in mind, here is my C solution, which takes full advantage of knowing the maximum size of problem ahead of time:

  1#include <string.h>
  2#include <stdlib.h>
  3#include <stdio.h>
  4#include <ctype.h>
  5
  6#define MAXTERM 1000
  7#define MAXLINE 80
  8#define MAXWORD 20
  9
 10int cmp( const void *a, const void *b ) {
 11  char *p = (char *)a;
 12  char *q = (char *)b;
 13  return *p - *q;
 14}
 15
 16int cmp2( const void *a, const void *b ) {
 17  char **p = (char **)a;
 18  char **q = (char **)b;
 19  return strcmp( *p, *q ); 
 20}
 21
 22int read_and_allocate_terms( char **terms ) {
 23  char *line = (char *)malloc( MAXLINE+1 );
 24  char *res;
 25
 26  int words = 0;
 27  int i, j; /* start and end of words input line */
 28  
 29  while( 1 ) {
 30    res = fgets( line, MAXLINE+1, stdin );
 31
 32    if( !res ) { /* Should not happen */
 33      fprintf( stderr, "Null pointer while reading\n" ); 
 34    }
 35
 36    if( res[0] == '#' ) { /* stop reading if first char is #; ignore newln */
 37      free( line );
 38      return words;
 39    }
 40    
 41    for( i=0; ; i++ ) {
 42      /* Break if end-of-line */
 43      if( line[i] == '\n' ) {
 44        break;
 45      }
 46      
 47      /* Iterate to find start of word */
 48      if( line[i] != ' ' ) {
 49        
 50        /* Iterate to find end of word */
 51        for( j=i+1; ; j++ ) {
 52          if( line[j] == ' ' || line[j] == '\n' ) {
 53            terms[words] = strndup( line+i, j-i );
 54            words += 1;
 55            i = j-1; /* pick up at crr pos of j again: for(i) will inc! */
 56            break;
 57          }
 58        }
 59      }
 60    }
 61  }
 62}
 63
 64int main() {
 65  char **terms = (char **)malloc( MAXTERM );
 66  char **sorted = (char **)malloc( MAXTERM );
 67  char **output = (char **)malloc( MAXTERM );
 68  int *res = (int *)malloc( MAXTERM );
 69  
 70  int i, j, words, matches, flag; 
 71
 72  /* populate terms with the input words */
 73  words = read_and_allocate_terms_getchar( terms );
 74
 75  /* duplicate input, lowercase, then sort each word alphabetically */
 76  for( i=0; i<words; i++ ) {
 77    sorted[i] = strdup( terms[i] );
 78    for( j=0; j<strlen(sorted[i]); j++ ) {
 79      sorted[i][j] = tolower( sorted[i][j] );
 80    }
 81    qsort( sorted[i], strlen(sorted[i]), sizeof(char), cmp );
 82  }
 83
 84  /* compare all sortd words to all others; remember those that dont match */
 85  matches = 0;
 86  for( i=0; i<words; i++ ) {
 87    flag = 0;
 88    for( j=0; j<words; j++ ) {
 89      if( i!=j && strcmp( sorted[i], sorted[j] ) == 0 ) {
 90        flag = 1;
 91      }
 92    }
 93
 94    /* if no match, remember the index of the trial word */
 95    if( flag == 0 ) {
 96      res[ matches ] = i;
 97      matches += 1;
 98    }
 99  }
100
101  /* copy, then sort for output */
102  for( i=0; i<matches; i++ ) {
103    output[i] = strdup( terms[res[i]] );
104  }
105  qsort( output, matches, sizeof(char *), cmp2 );
106  for( i=0; i<matches; i++ ) {  
107    printf( "%s\n", output[i] );
108  }
109
110  /* clean up */
111  for( i=0; i<words; i++ ) {
112    free( terms[i] );
113    free( sorted[i] );
114  }
115  for( i=0; i<matches; i++ ) {
116    free( output[i] );
117  }
118  free( terms );
119  free( sorted );
120  free( output );
121  free( res );
122
123  return EXIT_SUCCESS;
124}

Notice how much of the program deals only with reading and parsing the input file! Essentially the entire routine read_and_allocate_terms() does nothing but read the file, identify individual words, and store them for later use. Another significant source of boilerplate is the explicit memory management: not only all the calls to malloc() (directly and indirectly through strdup()), but also the crucial calls to free() at the end.

The main “algorithm” for detecting duplicate permutations (or their absence) is virtually unchanged from the Python version. Without a readily available hashmap, the program has to iterate over an array of possible hashes. Given the small size of the input, this is not a problem in this case.

Finally: although I tested the code, I make no claims that it will correctly handle edge cases and not leak memory!

Conclusion

At least to me, one of the most welcome changes in programming overall has been the reduction of effort required for “non-essential tasks”: bookkeeping, resource management, I/O. In a similar spirit, I am deeply grateful for the (now) universal availability of basic data structures: hashmaps, resizable arrays, proper strings.

Postscript

It later occurred to me that, given the specifics of the input format, it might be more convenient to read the input character-by-character, rather than breaking it into lines. Here is an alternative version of the read_and_allocate_terms() routine that uses getchar() instead:

 1int read_and_allocate_terms_getchar( char **terms ) {
 2  char *buf = (char *)malloc( MAXWORD );
 3  char *p = buf;
 4  
 5  int words = 0;
 6  int c, flag; /* flag is 1 if actively scanning word; 0 for whitespace */
 7
 8  flag = 0;
 9  while( c=getchar() ) {
10    if( c=='#' || c==EOF ) {
11      free( buf );
12      return words;
13    }
14    
15    if( flag == 0 ) {
16      if( c!=' ' && c!='\n' ) {
17        flag = 1;
18        *p++ = c;
19      } 
20    } else {
21      if( c!=' ' && c!='\n' ) {
22        *p++ = c;
23      } else {
24        flag = 0;
25        terms[words] = strndup( buf, p-buf );
26        words += 1;
27        p = buf;
28      }
29    }
30  }
31}

It is amusing to see that employing a “higer-level abstraction” (lines and words, instead of individual characters) in this case leads to more code, not less.

Post-Postscript

An astute reader reminded me of the strtok() function that breaks a string into delimiter-separated tokens. So, here is yet another version of the read_and_allocate_terms() routine; this one uses strtok():

 1int read_and_allocate_terms_strtok( char **terms ) {
 2  char *line = (char *)malloc( MAXLINE+1 );
 3  char *res;
 4
 5  int words = 0;
 6  char *tok;
 7  char *delim = " \n";
 8  
 9  while( 1 ) {
10    res = fgets( line, MAXLINE+1, stdin );
11
12    if( !res ) { /* Should not happen */
13      fprintf( stderr, "Null pointer while reading\n" ); 
14    }
15
16    if( res[0] == '#' ) { /* stop reading if first char is #; ignore newline */
17      free( line );
18      return words;
19    }
20
21    tok = strtok( res, delim );
22    if( tok ) {
23      terms[words] = strdup( tok );
24      words += 1;
25    }	
26    while( tok = strtok( NULL, delim ) ) {
27      terms[words] = strdup( tok );
28      words += 1;
29    }
30  } /* end while( 1 ) */
31}

Curiously, this version isn’t actually shorter than the one using getchar(), but it is arguably easier to understand, because there is less pointer arithmetic and the logic is simpler.

That being said, it is no accident that I forgot about strtok() on my previous pass: it is not a routine that I am comfortable with. Not only does it maintain state between invocations (rare for a C function) and behaves differently on subsequent calls (not idempotent), it also mangles its input (and will therefore fail mysteriously when invoked on a constant string). Its Linux man page suggests a certain queasiness, and contains some actual ambiguity in regards to the semantics of the delim argument: it is not entirely clear whether any one of the characters in the supplied string serves as a delimiter (this is in fact the case), or whether only the fixed combination of the entire argument separates tokens.