PhillyScript: Writing my own programming language

The opinions stated here are my own, not those of my company.

A little while ago, I came across a blog post on FreeCodeCamp about how the author wrote their own programming language. In my time I’ve used a wide variety of programming languages, and the choices made for different syntaxes have been interesting. It led me to wonder what choice I would make.

Defining the syntax can be the easiest part of a language, as then you need to write the tooling to make it work. After reading a second blog post, I took the route of transpiling my language to JavaScript. That way, I wouldn’t need to translate it all the way down to machine code. I could spend the bulk of my time rapidly prototyping.

Philadelphia skyline

I’ve named it PhillyScript, growing up in the larger Philadelphia region.

As I started playing around with syntax choices over the weekend, I found myself gravitating towards two main purposes:

In this post I’ll introduce the language and some of the choices I made, as well as how I started out.

How to create your own language

To create a programming language, you’ll need three main components:

The Lexer is the component that reads your file from top to bottom, trying to make sense of the syntax and throwing an issue if it can’t understand. With the Lex NPM library I could define my language semantics with regular expressions.

const grammar = new Lexer((char: string) => {
throw new Error(`Unexpected character "${char}" at row/col ${gRow}:${gCol}`)
}) as Lex
grammar
.addRule(R.classExtension, (lexeme: string, ...ops: Op[]) => {
return ['CLASS_EXTENSION', ...ops]
})
.addRule(R.classDeclaration, (lexeme: string, ...ops: Op[]) => {
return ['CLASS', ...ops]
})
.addRule(R.classInstantiation, (lexeme: string, ...ops: Op[]) => {
return ['CLASS_INSTANTIATION', ...ops]
})
// ...

What this gives me is an array of every token and operand. At the very end of the array I should have the token ‘EOF’, end-of-file. If not, that means the entire file was not parsed, likely due to a syntax error (or my own regex mistake).

['CLASS', 'Bird', 'CLOSE_CURLY', 'NEWLINE', 'NEWLINE', 'CLASS_EXTENSION', 'Duck', 'Bird', 'CLOSE_CURLY', ..., 'EOF']

You can see in my example that I’ve got this mix of token types like ‘CLASS’ and class names like ‘Bird’. This array is sent to the parser.

The parser converts this array into an Abstract Syntax Tree (AST). This provides a more thorough representation of each command. Based on each token, I define an operation to iterate through this array and convert it into an object representation.

export const parse = (tokens: Op[]) => {  let c = 0;  const peek = () => tokens[c];
const consume = () => tokens[c++];
const ast: AstLeaf[] = [] const parsers: Parsers = {
EOF: () => {},
CLASS: () => {
ast.push({
type: 'CLASS',
val: consume(),
})
},
CLASS_EXTENSION: () => {
ast.push({
type: 'CLASS_EXTENSION',
var: consume(),
val: consume(),
})
},
CLASS_INSTANTIATION: () => {
ast.push({
type: 'CLASS_INSTANTIATION',
var: consume(),
val: consume(),
})
},
// ...
}
while (peek() !== 'EOF') {
try {
parsers[consume() as Token]()
} catch (e) {
throw new Error(`Cannot get parse next token "${peek()}" at index ${c} for ${tokens.join(', ')}`)
}
}
return ast
};

As you can see, the parser will continue to execute until it hits this EOF, so it is imperative that the lexer works completely before we get to this step. For tokens that have multiple operands, it will consume and advantage the current array index so that by the end we should arrive at the file end. The number of consumptions need to perfectly match the number of operands, otherwise you’ll hit an off-by-one problem and get a parsing error.

[
{ type: 'CLASS', val: 'Bird' },
{ type: 'CLOSE_CURLY' },
{ type: 'NEWLINE', val: '\n' },
{ type: 'NEWLINE', val: '\n' },
{ type: 'CLASS_EXTENSION', var: 'Duck', val: 'Bird' },
{ type: 'CLOSE_CURLY' },
{ type: 'NEWLINE', val: '\n' },
// ...
]

An example AST, shown above, demonstrates what this looks like, providing a more precise definition of each step of my program. From here, I now have enough information that I can compile it, or transpile it, to my desired language.

The transpiler is the last step, and it converts this AST into code. I’m targeting JavaScript here, which would allow me to actually run it. Otherwise I would have to compile it either to machine code or build my own interpreter to execute it. Both are onerous, so transpiling is the most accessible option.

For each part of the AST, I expand that into executable code. This ends up being a large string, which can then be sent to a file or printed in the console. As we’ve already gone through the lex and parse steps, there is a high degree of certainty that the input should be valid, and the output should also be valid.

export const transpile = (ast: AstLeaf[]) => {
let transpilation = ''
const transpilers: Transpilers = {
NEWLINE: () => {
return '\n'
},
CLASS: (leaf: AstLeaf) => {
return `class ${leaf.val} {`
},
CLASS_EXTENSION: (leaf: AstLeaf) => {
return `class ${leaf.var} extends ${leaf.val} {`
},
CLASS_INSTANTIATION: (leaf: AstLeaf) => {
return `const ${leaf.var} = new ${leaf.val}()`
},
// ...
}
ast.forEach(leaf => {
transpilation += transpilers[leaf.type](leaf)
})
return transpilation
};

So running transpile(parse(lex(input))) should result in our final code:

class Bird {}class Duck extends Bird {}

It seems overly complicated for something simple. However, now that we’ve defined each step, we can start adding in some new features.

PhillyScript: Reducing boilerplate

Boilerplate can be annoying to write. If I can leverage specific syntax to reduce the amount of code I’m writing, I’ll be better off.

I do recognize this can result in a higher learning curve. I imagine language developers do have to work on these trade-offs to make something broadly useful for the public, but I’m not going to worry about that here.

boul Bird {}boul Duck <- Bird {}jawn x := 1
jawn* y := 2
jawn z @ Bird

Variable declarations are now called jawn. Adding an asterisk after it makes it mutable, as immutability is the defaut. I use the := syntax for setting variables to distinguish it from the is-equals condition ==, which people may mistake.

Classes are called boul, and the arrow syntax <- states that the class extends from a given class.

Rather than writing out that we’re creating a new class object with no constructor properties, we can write that a given variable is of a given class with the @ syntax. That reduces the code of const z = new Bird() since most of that is superfluous.

Async/Await

The async and await keywords are great additions to the language, but writing out those keywords can be simplified using the hash syntax.

fun# asyncFunction {
jawn x := #asyncOperation()
return x
}

Function declarations are now reduced to fun. By appending a hash, we can define it as asynchronous. As this function has no parameters, we don’t need to add parenthesis. By prepending a hash to a function call, we mark it as awaiting.

Conditionals

Another area of boilerplate is in conditionals. If I’m checking a variable’s value twice I need to write out two entirely separate expressions. If I want to know if a number is between 0 and 10, I write: x > 0 && x < 10 . I can simplify this:

jawn x := 2
jawn y := 6
if (x == 2, 4) {
// This is true
}
if (y > 0, < 10) {
// This is also true
}

With the comma syntax, I can create two expressions for a single variable with one or multiple comparison operators.

PhillyScript: Mathematics

In college I took a number of math classes and used a few esoteric programming languages that had different audiences. The syntax used in writing proofs is very different, often conflicting, with general programming. So I wondered if I could design syntax that would work for more mathematical needs.

Estimations

Is 1.8 the same as 2.2? No, but they’re pretty close. In proofs you would say they’re approximately equal, but most programming languages don’t really have this operator. I’ve added the syntax, or a tilde, to define estimation.

jawn x := 5
if (x ≈ 6) {
// x is not close to 6
}

However, you can also specify an operand to compare it to:

jawn y := 5
if (y ~10~ 6) {
// y is close to 6, relative to 10
}

Magnitude comparisons

Likewise, magnitude comparisons are used in mathematics for when one value is much larger than another. By default this would be true if one is an order of magnitude greater, but we can specify this magnitude in the syntax.

printl(101 >> 10) // Prints 'true'
printl(60 <4< 20) // Prints 'false'

101 is much greater, more than 10 times larger than 10. However, 60 is not 4 times larger than 20.

Division-Remainder

Early on, learning long division, we would be taught that there is a dividend and a remainder as two separate values. Yet there is no syntax that actually produces this in most programming languages. The division-remainder operator produces an array with both values.

jawn x := 14
printl(x /% 4) // Prints '[3, 2]'

Factorial

Another operator that doesn’t appear in programming languages is factorial, a product of all values from 1-(value). This is introduced in this language.

printl(5!) // Prints '120'

Range selection

Range selection is a great Python feature, making it much easier to select values from an array or string. We can use this in PhillyScript as well.

jawn x := 'Hello World'
printl(x[:5]) // Prints 'Hello'
printl(x[6:7]) // Prints 'W'

Range Loops

Similar syntax can be used for defining for-loops, with an optional step modifier.

for (jawn i = 0:3) {
// Prints '0', '1', '2'
console.log(i)
}
for (jawn i = 0:5:20) {
// Prints '0', '5', '10', '15'
console.log(i)
}

This syntax is much simpler than having to write out the longer for (let i = 0; i < 3; i++) as much of this is superfluous. You don’t need to specify the variable name three times.

Array arithmetic

Another great Matlab feature is array arithmetic, using a dot to denote an operator should be applied to every element of an array.

jawn c := [0, 1, 2, 3]
// Prints '[2, 3, 4, 5]`
console.log(c .+ 2)
// Prints '[0, 5, 10, 15]'
console.log(c .* 5)

This is much cleaner than having to specify a custom map function every time.

Try out the language

The transpiler and sample code is available on GitHub, and I’ve put together a language guide as well with more detail and additional features.

Overall this was a good experience playing with programming and wondering if there are ways to make the developer experience better.

From the goal of reducing boilerplate, my sample file is 807 bytes compared to the compiled 2070 bytes. This means there are fewer characters, because syntax is simpler. There are fewer keywords and more symbols.

While some of these features can be done through custom functions and libraries, they can never produce syntax that is this straightforward. Array arithmetic would need functions or use of the map method and would feel a bit forced.

The opportunity to really play with syntax allows one to see if a better syntax can actually be better, which can let you create feature requests for the language developers and improve the ecosystem without having to get too deep into compilers.

As one last piece of advice, don’t use this in production. I’ve had a fun time, but will not be maintaining this with a lot of my time. Languages require a large ecosystem to get right, and you get a lot of guarantees and tooling support. PhillyScript has none of this, and I don’t plan to add it.

There are no libraries, no linters, no code highlighting, and nothing else that you would expect. The regular expressions I’ve written are rather brittle, and stuff may not work.

However, a simple project like this can let me contribute to a real language one day, and that’s where the real benefit comes in.

Social Media Expert -- Rowan University 2017 -- IoT & Assistant @ Google