[Next Section] [Next] [Up] [Previous] [Home] [Other]

Evolution of a Language

In my days as a student, perceiving the limitations of FORTRAN IV (now known as FORTRAN 66), but hesitant to embrace the alternatives that existed then, such as PL/I and Algol, I started thinking of what a computer language might look like if I designed it. I liked some of the features of APL, and thought they were worth copying.

This is only the description of a language as I thought about it: I did not have the time to embark on implementing it, even in a limited form.

One of the first things I thought of was a way to add data structures to FORTRAN, in a way that didn't change the basic appearance of FORTRAN programs, as consisting of independent statements usually short enough to fit on a single line.

So I thought up something I called the INTERLEAVE statement.

A structure might be defined in C like this:

typedef struct
 { int   serial ;
   int   quantity ;
 }
item ;

typedef struct
 ( int   number ;
   char  name[20] ;
   int   visits[3] ;
   item  orders[5] ;
 }
record ;

struct record employees[10] ;

I thought to use existing FORTRAN constructs, such as the EQUIVALENCE statement, in combination with INTERLEAVE, to allow an analogous structure to be declared like this:

      INTEGER*2    NUMBER(10)
      CHARACTER*20 NAME(10)
      INTEGER*2    VISITS(10,3)
      INTEGER*2    SERIAL(10,5)
      INTEGER*2    QUANT(10,5)
      CHARACTER*48 EMPLS(10)
      INTERLEAVE   ( NUMBER(1), NAME(1), VISITS(1,), ( SERIAL(1,2), QUANT(1,2) ) )
      EQUIVALENCE  ( EMPLS, NUMBER )

The INTERLEAVE statement prescribes that subscripts indicated by a number only vary after one goes through all the variables in the (outermost parenthesized) list (in which that number appears in a not otherwise parenthesized element). Of course, to determine the proper length for the variable representing a single record, one would have to be intimately familiar with how alignment is handled, but on the System/360, that was true of typical programmers.

That was, of course, a rather shaky start.

Another thing that concerned me was the issue of control structures. I wanted to permit the use of multiple statements as clauses in an IF statement, but again I did not want to stray too profoundly from FORTRAN. However, I did think that FORTRAN could be improved in one area, to make the decoding of statements more uniform.

My idea was to have three basic control statements, that looked like this:

 IF X :LT: Y, 10, 20
 TEST A, 15, 25, 35
 ON I, 100, 200, 300, 400, 500

Originally, I thought to use colons around the relational operators, so that the complicated rules required to deal with periods (also used as decimal points) could be removed from a compiler. The IF statement jumps to line 10 if X really is less than Y, and to line 20 otherwise. The TEST statement tests the numerical value which is its first argument, and jumps to 15 if it is negative, 25 if it is zero, and 35 if it is positive - the classic three-way branch of the Arithmetic IF statement, but no longer called an IF statement, since it does not have as an argument a proposition which is either true or false. And the ON statement, with its name taken from BASIC, is of course a computed GO TO.

As for an assigned GO TO, my concept was:

 I = %20
 JUMP I

where %20 stands for the address of the compiled code for line number 20. Thus, JUMP I is compiled as something like

         L     12,I
         BC    15,0(12)

Of course, since some machines might use integers too small to hold an address, and in the interests of somewhat stronger typing, I (much) later thought to add a type ADDRESS of which I could be declared to be rather than INTEGER.

Naturally, in my language, the format of statements would have been free compared to that of FORTRAN. Statement numbers could start in column 1, but the statement keyword itself would have to wait at least until column 2.

In FORTRAN, a comment looked like:

C THIS IS A COMMENT

so if, say, a type declaration like

      COMPLEX*32 DATA(32,32)

were to start in column 1, it could not be told from a comment.

All right, but if statements can't start in column 1, why just have the letter C indicate a comment? Why not allow the whole alphabet? Then the comment could start two characters earlier, saving space. I also allowed an asterisk (*) in column 1 to indicate a comment, so that boxes could be put around comments to emphasize them, or paragraphs could be indented, and so on.

From PL/I and RATFOR, though, I was familiar with other styles of comment. I used two-character combinations starting with an at-sign (@) for these and other related items, such as pragmas and preprocessor directives. In choosing multi-character tokens, I was careful to observe the prefix property, to keep processing of the language trivial.

From APL, and PL/I, I thought of making

      REAL A(6), B(6), C(6)
      A=B*C

equivalent to

      REAL A(6), B(6), C(6)
    7 FOR I=1 TO 6
      A(I)=B(I)*C(I)
      REPEAT 7

One issue this raised, though, is, what if A, B, and C were square two-dimensional arrays? Someone asked me if I would use matrix multiplication in that case. I definitely thought that that was not appropriate, since it attributed a meaning to the array structure that wasn't there. I did have an operator like the one in APL to do matrix multiplication:

     A = B :OVER(+,*): C

but much later, I realized the 'right' way to deal with this issue. Just as variables could be declared COMPLEX, or, in my language, QUATERNION as well, what is needed is a MATRIX type. Variables of MATRIX type would indeed undergo matrix multiplication when linked by the regular multiplication operator. (Then, I suppose, one might also bring in a VECTOR type, with dot and cross product operators.)

Just as one can't have a complex number whose real and imaginary parts are arrays, but one can have an array of complex numbers, one can have an array of matrices, but not a matrix of arrays. But one can have a matrix whose elements are complex or quaternion.

And note the form of the loop statement. Unlike in FORTRAN, instead of a loop starting with a statement that points to the end of the loop, the statement that transfers control to the beginning points to the beginning. And it does so by specifically naming the FOR statement to which it refers, instead of merely naming the loop variable, as in BASIC.

We've already seen that my basic IF statement looked like

      IF X :LT: Y, 10, 20

But I agreed that having lots of line numbers and branches was awkward if it could be avoided. So, I allowed the line numbers in an IF, TEST, or ON statement to be replaced by the following items:

an asterisk (*), indicating fall-through to the next statement, or
one or more statements, enclosed in angle brackets.

So, one might get statements looking like this:

      TEST X, <I=-1; GO_TO 20>, *, <I=1; GO_TO 20>

If X is zero, continue onwards; if it is nonzero, branch to line 20, but set a flag first depending on its sign.

Eventually, though, I took away the semicolon for use in placing multiple statements on the same line, and I no longer had the colon available to handle what were called "period words" in FORTRAN.

Also, I decided to switch from using only the characters

0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
.,:;!?'"()<>+-*/=_$#%&@

which were common to the uppercase-only subsets of both ASCII and EBCDIC, to the full ASCII character set.

Thus, I decided that I could use curly brackets to enclose statements in IF clauses. But unlike C (and Pascal and Algol), statement brackets allowed one or more statement to replace a line number, instead of multiple statements to replace a single statement. I regarded it as unFORTRANlike to permit a statement to properly be the direct object of another statement.

So I now allowed the use of <, >, and = as relational operators. But because of the prefix property, I refused to allow >= to stand for .GE., so I had to settle for ?GE? or maybe ?GE. as that operator. For exponentiation, ^ would serve if available, and ?PW? if it was not; even FORTRAN's ** was rejected, along with BASIC's >= and C would have horrified me had I known of it then.

Multiple statements on a single line now had to be separated by @; instead of ;. After thinking about how to deal with this, I decided to allow line breaks inside the curly brackets of an IF clause.

But then, would that mean an IF statement would look like this:

    IF X ?GE? Y, {
     P=P+1
     Q=5
 },{ P=P-1
     Q=4 }

Clearly, this was ugly and awkward. I allowed @- to indicate that a statement continues to the next line, but having to indicate that explicitly to clean up a default form of the IF statement like this was ridiculous.

So I decided to do what AWK does. If the last non-blank (the idea that a tab character would count as whitespace, of course, was far beyond me: if a control character is found in source code, obviously the compiler halts, because the programmer must be trying to compile an object file by mistake) character on a line is a comma, the compiler assumes that there is more to come, so there is an implicit line continuation.

Now, the IF statement can be much cleaner.

    IF X ?GE? Y,
     { P=P+1
       Q=5
     },
     { P=P-1
       Q=4
     }

and even in the earlier case, I didn't worry about line breaks creating blank lines as statements. Of course the compiler would have to be able to ignore null statements.

I didn't like the idea of ending every statement with a semicolon. Also, I didn't like using := for assignment, either. But if I used = as a relational operator too, how would I avoid ambiguity? And what about multiple assignment statements?

I opted for the following solution:

  LVALID/LMATCH=I=J/K

means, unambiguously (well, to the compiler!): set LVALID and LMATCH to _T (true) if I is equal to J divided by K, and equal to _F (false) otherwise.

In this way, an assignment is syntactically equivalent to an expression, and thus assignments can be used as arguments to functions and subroutines without complicating lexical analysis of the program.

The critical part of an assignment statement is the first = sign in it. To the left of it, / separates items in a list of assignment targets. To the right, / means division, and = is now a relational operator that tests for equality. I thought that was a simple enough rule.

In FORTRAN, a WRITE statement could look like this:

      WRITE (6,11,ERR=99) I, J, K

My concern with making an assignment look like an expression grew out of a desire, partly due to exposure to FORTH, to make the language extensible. I was already trying to make the syntax of the language simple and uniform. Almost every statement had the form

    keyword argument-list

where the argument-list at least resembled what could go between the parentheses of a subroutine call.

But the use of parentheses to enclose the device number, FORMAT statement number, and other clauses in a WRITE statement meant that one couldn't put expressions in the argument list of a WRITE statement. This was an unnecessary inconvenience compared to BASIC's PRINT statement.

Finally, I can say what I did with the colon (:) and semicolon (;).

I decided that, to allow argument lists to be expressive enough to support complex statement types, that these characters would function as alternative argument separators in addition to the comma (,). The colon would have higher 'priority', and the semicolon lower 'priority', than the comma if one thinks of them analogously to operators. (If one thinks of them as operators, of course, they all have a priority vastly lower than that of any operator.)

Thus, normally, the elements in an argument list are separated by commas.

If one argument has an optional argument that follows it, and is closely associated with it, that argument would be separated from it by a colon (:). If the argument list is divided into major sections, and the number of arguments in each section separated by commas is variable, then the sections would be separated by semicolons (;).

As for the output statement, I finally decided that optional arguments which preceded other arguments would be indicated with square brackets, so I decided that an output statement in my language would look like:

 PRINT [6,11,END=99] A*B+3.0, R(I+J(II),K+2)

and yes, of course arbitrary expressions are also permitted in subscripts.

One reason why I opted for the simple keyword/argument-list construction for statements is that I wanted to avoid reserved words completely in the language.

Thus, the words then and else did not appear in an IF statement.

Also, instead of using SIN and COS as the names of built-in functions, I went with _SIN and _COS. For user variable names (and subroutine names: the technical term is identifiers) the rule was that the first character could be a letter or an $, and the second and subsequent characters could also be such a character, or also a digit or the underscore (_). But predefined identifiers did begin with underscores, not permitted for user identifiers. Thus the use of _T and _F for true and false.

With no reserved words, a compiler could accept ?MQ? (standing for moins que) as the equivalent of ?LT? (standing for LESS THAN), or ALLEZ_AU as the equivalent of GO_TO (not GO TO, as in FORTRAN) without causing any problems to a programmer who happened to be unaware of the existence of that feature of the compiler.

I visualized standard support for about a dozen languages, more if a character set supporting Greek and Cyrillic characters, or even Arabic, Hebrew, Armenian, and Georgian, was available. With UNICODE, the technology has finally caught up with such a concept. While the reference version of the language might have been the one with English-language keywords, I thought of one other alternative. No, it wasn't Esperanto.

Latin.

(The derisive laughter may start now.)

But to have absolutely no reserved words, that meant that I finally did have to accept one extraneous character, after refusing to accept the semicolon (;) at the end of every statement (Algol, Pascal, PL/I, C, and many others) or the use of := for assignment or == for equality.

The assignment statement now became either

  LET A=5*B+C

as in BASIC, or, using the period (.) as an abbreviation for LET,

  . A=5*B+C

And, since the question mark (?) is a visually obtrusive character, as well as requiring the use of the SHIFT key, I decided to adopt some rules giving spaces limited significance in the language: no spaces within identifiers, or within the core of a numeric constant (the digits and the decimal point), and one space required after named operators.

Thus, ?NE? became just &NE and so on.

Later on, I explored ways of bringing in ideas from SNOBOL and LISP, as well as trying to make clear the concept of multiple threads as provided in Ada and Modula-2. The set of built-in functions included pretty well all the special functions that were well-known and were reasonably practical to calculate. The user could create types with operator and function overloading, but this did not include the full breadth of object-oriented programming.

But I hadn't gotten around to defining how an 'algebraic expression' data type would work, or how to deal with the world of GUI programming.

The full description of the language is now available here.

[Next Section] [Next] [Up] [Previous] [Home] [Other]