A JSON Parser

Jun 13, 2015

I recently built a JSON parser in JavaScript from scratch. I learned a lot on data formats/parse trees/grammar/recursion that helped me break down and solve the problem. There are many different (and perhaps better) approaches to parsing JSON, but here’s a little walkthrough of my implementation. First, a little background…

JSON

JSON (JavaScript Object Notation) is a data-interchange format commonly used over the web to transmit data.

What JSON is is basically a set of rules on how to format text to represent JS objects and arrays so that complex data can be transmitted using strings of characters. An array in JS such as [9, 'cat'] would be represented in JSON by this text [9,“cat”].

var a = [9, 'cat'];
var json = JSON.stringify(a); // json = "[9,"cat"]"

Even though JSON has the word JavaScript in it, it is pretty much universal (a lot of APIs will provide data in JSON format). Almost all programming languages have built in functions that can convert between JSON and useable data types (For example, Objective-C has a method called NSJSONSerialization that can parse JSON into NSArray or NSDictionary).

Grammar

JSON’s grammar defines the correct format for JSON text. These are the rules we need to follow to properly parse the JSON text. If the text does not follow the correct grammar, then the parser should throw an error message to indicate that the JSON text is unparseable.

JSON grammar can be (partially) defined like this:

<JSON>     ::= <value>
<value>    ::= <object> | <array> | <boolean> | <string> | <number> | <null>
<array>    ::= "[" [<value>] {"," <value>}* "]"
<object>   ::= "{" [<property>] {"," <property>}* "}"
<property> ::= <string> ":" <value>

This is BNF (Backus-Naur Form) notation to describe context-free grammars. I define the grammar for <JSON> to be a <value>. And on the next line, I define <value> to be any of the 6 data types - object, array, boolean, string, number, or null.

On the third line, I define an <array> as text that begins with an opening bracket, followed by <value>'s separated by commas, ending with a closing bracket.

On the fourth line, I define an <object> in a similar way to arrays, except it contains a series of <property>’s separated by commas. A <property> is defined on the last line as a <string> (the property name) and <value> (property value), separated by a colon.

Quick note on the notation:

The square brackets within quotation marks indicate that the JSON text should actually have the square bracket characters. The square brackets without quotation marks denote optional text.
Similarly, the curly braces within quotation marks denote curly brace characters. The curly braces without quotation marks, followed by asterisks, denote text that repeats 0 or more times.

Note the use of recursion in the grammar definition. Each element of an <array> (which is one possible branch of <value>) is another <value>, which can itself be an <array> or <object> or anything defined by the <value> grammar rules.

The grammar for booleans, strings, numbers, and null can also be further defined using BNF notation, but these are primitive values, so I’d rather just describe them in words below. For this purpose, we can think of these primitive data types as the terminals (that’s where the recursions end).

Recursive Descent Parser

A parse tree, or expression tree, is the result of parsing the text into components that fit the grammar. In our example, given the JSON text [9,“cat”], the parse tree would look something like this:

Parse Tree

A Recursive Descent Parser, one of the simplest parsers, uses a top-down parsing approach, meaning we look at the highest level (the array in this case), and then work downwards (or rightwards in my graph) to parse its elements.

Creating this parse tree involves checking the grammar of all possible <values> each time a <value> to determine which one of them fits. Note that the <number> and <string> are the terminals of the parse tree. Once we have created the parse tree, we can create a JS array by pushing the number (9) and string (“cat”) into a newly created empty array, thereby parsing the JSON text into a JavaScript array.

The parser will work by looking at each character of the JSON text left to right. Here are some useful variables/functions to start our parser:

var at, // current index of JSON text
    ch; // character at current index

var next = function() {
  // increments at
  // updates ch
  at += 1;
  ch = json.charAt(at); // json is the JSON text passed into our parser
  return ch;
};

var error = function(message) { // throw error for bad syntax
  console.log(message);
  throw undefined;
};

We’ll use next to traverse our json text and error to throw errors for bad syntax.

Parsing a Value

Based on our grammar definition, <JSON> is just a <value>, which can be one of six possible data types. Each data type can be identified by the first character of the <value> text, so no backtracking is required (this is not always true depending on the grammar, but it is for JSON). Here is a function that takes ch, the character at the current index, and uses a switch statement to determine which function to call to parse the following text.

var value = function () {
  switch(ch) {
    case '{':
      return object();
    case '[':
      return array();
    case '\"':
      return string();
    case 't':
    case 'f':
      return bool();
    case 'n':
      return nully();
    default:
      if(ch === '-' || (ch && ch >= 0 && ch <= 9)) { // number
        return number();
      } else {
        error('bad JSON');
      }
      break;
  }
};

Now we just need to implement the functions corresponding to each data type in order to fully parse the JSON.

Recall how our grammar stated that objects and arrays also have <value>’s within their JSON syntax. This is the recursive aspect of our parser. Our parser will call value from within object and array based on that syntax. The recursion ends once a primitive value is reached (recall that they are the terminals of our parse tree).

To make each function consistent, each function will be called with ch starting at the first character of that value (“[” for an array, for example). It should traverse along the json until it reaches the closing character based on JSON grammar (“]” for arrays). It should then parse the JSON text between the starting and closing character, returning the corresponding JS data (an object, array, etc.), and then set at to the next index (i.e. the first character of the next <value>).

Here are the implementations of each possible data type:

Null

The function nully starts at the current character ch and checks if it and the next three characters correspond to “n”, “u”, “l”, and “l”. If so, it returns null, otherwise, the <value> does not follow proper JSON syntax since there is no JSON value that starts with n other than JSON null (strings start with quotation marks, not letters). The _.times function is from Underscore.js and simply runs the function n times.

var nully = function() {
  // ch is at 'n', verify and return null
  var nully = '';
  if(ch === 'n') {
    _.times(4, function() {
      nully += ch;
      next();
    });
    if(nully === 'null') {
      return null;
    } else {
      error('bad null');
    }
  }

  error('bad null');
};

Booleans

bool is set up similar to nully to check for characters that match the words true or false.

var bool = function() {
  // ch is at 't' of 'f', verify & return the boolean
  var bool = '';
  if(ch === 't') {
    _.times(4, function() {
      bool += ch;
      next();
    });
    if(bool === 'true') {
      return true;
    } else {
      error('bad bool');
    }
  } else if(ch === 'f') {
    _.times(5, function() {
      bool += ch;
      next();
    });
    if(bool === 'false') {
      return false;
    } else {
      error('bad bool');
    }
  }

  error('bad bool');
};

Numbers

The number function is a little more complicated because of the possible formats of a number. A JSON number consists of the following:

Optional negative sign
One or more consecutive digits 0-9 (0 can only be the first digit if followed by decimal point)
Optional decimal point
One or more consecutive digits 0-9 after decimal point
“e” or “E” to signify exponents
Positive or negative sign for the exponent
Consecutive digits 0-9 (only integers allowed here)

The number function constructs a string based on these components and then calls Number on it to convert it into a proper JS number. If the conversion fails, then the JSON number format was bad.

var number = function() {
  // ch is at negative sign '-' or digit 0-9, create & return the number
  var number = ''; // create string and then use Number() to convert
  function getDigits() { // collect consecutive digits until non-digit is reached
    while(ch && ch >= 0 && ch <= 9) { // need to avoid empty strings
      number += ch;
      next();
    }
  }

  // optional - get neg sign
  if(ch === '-') {
    number += ch;
    next();
  }

  getDigits();

  // optional - get decimal point
  if(ch === '.') {
    number += ch;
    next();
    getDigits();
  }

  // optional - get exponential
  if(ch === 'e' || ch === 'E') {
    number += ch;
    next();
    // required - get sign of exponent
    if(ch === '-' || ch === '+') {
      number += ch;
      next();
    }
    getDigits(); // exponent
  }

  if(!isNaN(Number(number))) { // check if string can be converted to number
    return Number(number);
  } else { // string could not be converted to number
    error('bad number');
  }
};

Strings

If you declare a string in JS var a = "backslash\\", a will hold the string backslash\ since \ is used as an escape. But the JSON string of the variable a is actually “backslash\\”. Same goes for the escaped characters \b, \n, \t, \f, and \". So for strings, we need to pay attention to the escape character “\”, so that if we encounter backslash\\ in a JSON string, we know it’s a string with only one backslash character instead of two.

var escapes = { // helper variable
  'b': '\b',
  'n': '\n',
  't': '\t',
  'r': '\r',
  'f': '\f',
  '\"': '\"',
  '\\': '\\'
};

var string = function() {
  // ch is at opening quote, create & return the string
  var string = '';
  if(ch !== '\"') error('string should start with \"');
  next();
  while(ch) {
    // watch for end of string
    if(ch === '\"') {
      next();
      return string;
    }

    // watch for escapes
    if(ch === '\\') {
      next();
      if(escapes.hasOwnProperty(ch)) {
        string += escapes[ch];
      } else {
        // if not a proper escape code, ignore escape and just add char
        // NOTE: this should never be called if proper stringified JSON provided
        string += ch;
      }
    } else {
      // anything other than \ and " => just add character to string
      string += ch;
    }
    next();
  }
  // reached end without closing quote => error
  error('bad string');
};

Arrays

With the primitives out of the way. We only have the collections (arrays and objects) remaining. The array function actually leverages the value function to populate all its elements, which are identified as the entire span of text between brackets and commas. If the end of json is reached without finding a closing square bracket, then it’s a bad JSON.

var array = function() {
  // ch is at opening bracket, create & return the array
  var array = [];
  if(ch !== '[') error('array should start with [');
  if(next() === ']') return array; // empty array

  do {
    array.push(value());
    if(ch === ']') { // array end reached
      next();
      return array;
    }
  } while(ch && ch === ',' && next()); // found ',' => more elements to go

  error('bad array');
};

Objects

object is implemented very similar to array, except it must also parse the key of each property. The key is just a JSON string though, so simply calling string will do.

var object = function() {
  // ch is at opening curley brace, create & return the object
  var object = {};
  if(ch !== '{') error('object should start with {');
  if(next() === '}') return object; // empty object

  do {
    var key = string(); // get key
    if(ch !== ':') error('object property expecting ":"');
    next();
    object[key] = value(); // create property with whatever value is, perhaps another object/array
    if(ch === '}') {  // object end reached
      next();
      return object;
    }
  } while(ch && ch === ',' && next()); // found ',' => more properties to go

  error('bad object');
};

Execute

Now that all the pieces of value have been implemented, we just need to wrap all the code starting from here into a function - let’s call it parseJSON - and add a couple of lines of code to execute the parser.

function parseJSON(json) {

  /*
  insert code for:
  at, ch, next, error, value, nully, bool, number, escapes, string, array, object
  */

  at = 0;
  ch = json.charAt(at);
  return value();
}

Other Considerations

More on Strings

There is actually one more escape for JSON strings that I did not take into account in my parser - the \u followed by four hexadecimal digits. This escape is used for special characters and symbols. The implementation would be similar to the other escapes. Instead of adding the characters “\”, “u”, and the 4 hexadecimal digits into the string, you would instead add whatever character the escape code represents.

Reviver

The documentation for JSON.parse on MDN shows that there’s an optional second argument, reviver. The reviver function transforms the parsing result before returning it. Another feature that would be good practice to implement.

Resources

I found the Wikipedia entries for parse trees and BNF notation pretty informative.
The railroad diagrams on JSON.org are great for visualizing the JSON data syntax.
For more on Recursive Descent Parsers, check out these links:
- Recursive Descent Parser Tutorial
- JSON Parser Example