The starting point of a BitTorrent client is the torrent file. This file tells the client about the file it is about to download. It has a description of where the client can find peers, how large the file is expected to be, as well as chunk signatures so that the download can be verified. For this post, I am going to be using sintel.torrent as our example.

A torrent file is bencoded (pronounced b-encoded), which means we will need to write a small parser in order to understand the torrent file. That is the goal for this first post, but first we need to write some CLI program boilerplate.

CLI program shell

In order to create a command line program, you need to read the program arguments. Without arguments you don't know what the program is supposed to do. In our case, we want to locate the file that the user wants to download. To refresh your memory, our command line interface is going to be used as bt-get <torrent-file>.

In node.js, the program arguments are stored in a default global called process.argv. In our case, if we run bt-get example.torrent, the process.argv will look like this:

[ '/usr/bin/node',
  '/path/to/project/bt-get.js',
  'example.torrent' ]

So to get the torrent file path we need to check for the 3rd entry in argv, or process.argv[2]. If the user didn't enter a torrent file path, we should show a good error.

let torrent_file_path = process.argv[2]
if (!torrent_file_path) {
	console.error('Usage: bt-get <torrent-file>')
	process.exit(1)
}

Once we have the path, we need to check if it exists so we can show a good error if the path doesn't exist.

if (!fs.existsSync(torrent_file_path)) {
    console.error(torrent_file_path+' does not exist')
    process.exit(1)
}

Then we need to load the torrent file and parse the data out of it.

let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
let torrent_data = decode_torrent(bencoded_torrent_data)
console.log('The torrent you are trying to download:')
console.log(torrent_data)

Bencoded basics

In order to make our decode_torrent function work, we first need to discuss the basics of bencoded data. The format is covered in the spec and in more thorough detail in the meta spec, but here is a brief summary:

  • Byte strings are encoded as <length>:<text>, so 6:foobar is "foobar". An empty string is encoded 0:.
  • Number literals are wrapped between i and e (e.g. i42e is 42).
  • A list begins with l and ends with e. Everything between are the list items.
    For example: li1ei2ei3ee is [1, 2, 3]
  • Dictionaries begin with d and end with e. Everything between is alternating key/value sets. d3:foo3:bare is {foo: 'bar'}.
  • Note that both dictionaries and lists can be empty. de and le respectively.

If we peek at the beginning of our sintel.torrent, we will see this text: d8:announce40:udp. The d at the beginning marks the start of a dictionary, so we will interpret the following values as alternating key-values until we encounter the closing e. Next, the text 8:announce means that the first key in the dictionary is "announce". The next value we see is 40:udp://tracker.leechers-paradise.org:6969 (the first value in the dictionary). So our dictionary so far is {announce: 'udp://tracker.leechers-paradise.org:6969'}. And so on.

Bencode parser

To write a parser that handles a bencoded document, we first need to write functions that can decode each of the value types. Instead of directly returning the parsed value, each parsing function will return a "token" object of which looks like this: {value: ..., length: ...}. "value" will be the parsed value, and "length" will be the number of characters that were consumed to produce the final parsed value. Keeping the length like this will make it easier to chop up the input string in the more complex data type parsers.

One of the simplest values to parse is a number, let's handle that first.

Number parser

To parse a bencoded number we need to pop the i off the front and the e from the end. The remaining result can be converted directly to a number. The location of i is easy, since we expect it always to be the beginning of the value. The location of e is also easy since we can just search for the first e in the string. A number cannot have nested values, so the first e we find will always be the end of the current number.

To find the length of the integer token, we need to take the index of the e and add 1 (because strings are 0 indexed).

So the final function looks like this:

function parse_number(data) {
    let start = 1
    let end = data.indexOf('e')
	if (end === -1) {
		throw new Error('Number did not end correctly '+data.slice(0, 30))
	}
    let length = end + 1
    let value = parseInt(data.slice(start, end))
    return {value: value, length: length}
}

We can test that our function is working correctly by running some test parses.

console.log(parse_number('i234e'))
//> { value: 234, length: 5 }
console.log(parse_number('i-5500e'))
//> { value: -5500, length: 7 }
console.log(parse_number('i-5500'))
//> Error: Number did not end correctly: i-5500

Byte string parser

Byte strings are also quite easy. We find the location of the :, then parse everything up to it as a number. Then we take that many bytes from the data buffer and return it as the value.

function parse_number(data) {
    let start = 1
    let end = data.indexOf('e')
    if (end === -1) {
    	throw new Error('Number did not end correctly: '+data.slice(0, 30))
    }
    let length = end + 1
    let value = parseInt(data.slice(start, end))
    return {value: value, length: length}
}

Once again, we can run some tests to show that the code is correct:

console.log(parse_byte_string('6:foobar'))
//> { value: 'foobar', length: 8 }
console.log(parse_byte_string('12:foobarfoobari50e'))
//> { value: 'foobarfoobar', length: 15 }

Generic value parser

Lists and dictionaries can both contain sub-values (including more dictionaries and lists). In order to support these datatypes, we need a function that can parse an unknown value out of a data string.

Luckily, it is really easy to see what the datatype is going to be. For dictionary/list/number, you can just check the first character of the string. To check for a byte string, just walk through the data string until you encounter a non-number or a :.

function parse_unknown(data) {
  if (data[0] === 'd') {
    return parse_dictionary(data)
  }
  if (data[0] === 'l') {
    return parse_list(data)
  }
  if (data[0] === 'i') {
    return parse_number(data)
  }
  if (is_byte_string(data)) {
    return parse_byte_string(data)
  }
  throw new Error('Encountered unknown token: '+data.slice(0, 30).toString())
}

let digits = '0123456789'
function is_byte_string(data) {
  for (let i = 0; i < data.length; i++) {
    // ':' cannot be the first character in data
    // If this ever becomes true, then we know that every character
    // that we found up to this point was a number.
    // If we encountered a non-number, the function would have returned
    if (i > 0 && data[i] === ':') {
      return true
    }
    if (!digits.includes(data[i])) {
      return false
    }
  }
  throw new Error('Found only digits to end of data: '+data)
}

Edit: Our first bug! This version of the code doesn't work because we're treating data as a string, when it's really a Buffer. So when we take an index data, we get a number (58) instead of a string  (':'). This affects is_byte_string and parse_unknown, as well as the initial draft versions of  parse_list, and parse_dictionary (found below). The reason we didn't catch it before was because we were testing with console.log and string inputs. That just goes to show you that passing tests don't always mean everything is OK.

If we look at the Buffer docs, we find that Buffers supports calling .toString() with encoding, start, and end. So we can correct our code by adding a utility method called char_at(buffer, index). The new code looks like this:


function parse_unknown(data) {
  let first_char = char_at(data, 0)
  if (first_char === 'd') {
    return parse_dictionary(data)
  }
  if (first_char === 'l') {
    return parse_list(data)
  }
  if (first_char === 'i') {
    return parse_number(data)
  }
  if (is_byte_string(data)) {
    return parse_byte_string(data)
  }
  throw new Error('Encountered unknown token: '+data.slice(0, 30).toString())
}

let digits = '0123456789'
function is_byte_string(data) {
  for (let i = 0; i < data.length; i++) {
    let char = char_at(data, i)
    // ':' cannot be the first character in data
    // If this ever becomes true, then we know that every character
    // that we found up to this point was a number.
    // If we encountered a non-number, the function would have returned
    if (i > 0 && char === ':') {
      return true
    }
    if (!digits.includes(char)) {
      return false
    }
  }
  throw new Error('Found only digits: '+data.slice(0, 50))
}

function char_at(buffer, index) {
  return buffer.toString('utf8', index, index+1)
}

List parser

With this unknown value parser in place, we can now write our list parser. Lists can contain sub-values, so the lengths that we've been storing so far start to come in handy.

To parse the list, we need to first skip the l that marks the beginning of the list. Then we parse unknown values from the beginning of the data string until the first unprocessed character is an e to mark the end of the list. For every newly parsed value, we discard it from the data string and add it's length to the length of the list.

function parse_list(data) {
  let length = 2 // list is wrapped in `l/e`, so the length is at least 2
  let list_value = []
  data = data.slice(1) // discard the first 'l'
  while (data[0] !== 'e') {
    if (!data.length) {
      throw new Error('Never encountered the end of the list')
    }
    let next_value = parse_value(data)
    list_value.push(next_value.value)
    length += next_value.length
    data = data.slice(next_value.length)
  }
  return {value: list_value, length: length}
}

And the tests that show that the code is working:

console.log(parse_list(`le`))
//> { value: [], length: 2 }
console.log(parse_list(`li1ei2ei3ee`))
//> { value: [ 1, 2, 3 ], length: 11 }
console.log(parse_list(`l6:foobare`))
//> { value: [ 'foobar' ], length: 10 }
console.log(parse_list(`li1ei2ei3e`))
//> Error: Never encountered the end of the list

Dictionary parser

Finally, the dictionary parser. The code for this one is very similar to the list parser code. The only difference is that we alternately set parsed values as keys and values.

function parse_dictionary(data) {
  let length = 2 // list is wrapped in `d/e`, so the length is at least 2
  let dict_value = {}
  data = data.slice(1) // discard the first 'd'
  let key
  while (data[0] !== 'e') {
    if (!data.length) {
      throw new Error('Never encountered the end of the dictionary')
    }
    let next_value = parse_value(data)
    length += next_value.length
    data = data.slice(next_value.length)
    if (key) {
      dict_value[key] = next_value.value
      key = null
    } else {
      key = next_value.value
    }
  }
  return {value: dict_value, length: length}
}

And the tests:

console.log(parse_dictionary('d3:foo3:bare'))
//> { value: { foo: 'bar' }, length: 12 }
console.log(parse_dictionary('de'))
//> { value: {}, length: 2 }
console.log(parse_dictionary('d'))
//> Error: Never encountered the end of the dictionary

Parsing the Torrent File

To refresh your memory, our program core looks like this:

let torrent_file_path = process.argv[2]
  if (!torrent_file_path) {
    console.error('Usage: bt-get <torrent-file>')
    process.exit(1)
  }
  if (!fs.existsSync(torrent_file_path)) {
    console.error(torrent_file_path+' does not exist')
    process.exit(1)
  }
  let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
  let torrent_data = decode_torrent(bencoded_torrent_data)
  console.log('The torrent you are trying to download:')
  console.log(torrent_data)

According to the BitTorrent spec, a .torrent file will always be encoded as a dictionary. So in the end our decode_torrent function is quite small.

function decode_torrent(torrent) {
	let token = parse_dictionary(torrent)
	return token.value
}

And the result of running our program on sintel.torrent like this:

announce: <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39>
announce-list: [ [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39> ],
  [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 63 6f 70 70 65 72 73 75 72 66 65 72 2e 74 6b 3a 36 39 36 39> ],
  [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 74 72 61 63 6b 72 2e 6f 72 67 3a 31 33 33 37> ],
  [ <Buffer 75 64 70 3a 2f 2f 65 78 70 6c 6f 64 69 65 2e 6f 72 67 3a 36 39 36 39> ],
  [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 65 6d 70 69 72 65 2d 6a 73 2e 75 73 3a 31 33 33 37> ],
  [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 62 74 6f 72 72 65 6e 74 2e 78 79 7a> ],
  [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 77 65 62 74 6f 72 72 65 6e 74 2e 63 6f 6d> ],
  [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 66 61 73 74 63 61 73 74 2e 6e 7a> ] ]
comment: <Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>
created by: <Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>
creation date: 1490916637
encoding: <Buffer 55 54 46 2d 38>
info: { files:
   [ { length: 1652, path: [Array] },
     { length: 1514, path: [Array] },
     { length: 1554, path: [Array] },
     { length: 1618, path: [Array] },
     { length: 1546, path: [Array] },
     { length: 129241752, path: [Array] },
     { length: 1537, path: [Array] },
     { length: 1536, path: [Array] },
     { length: 1551, path: [Array] },
     { length: 2016, path: [Array] },
     { length: 46115, path: [Array] } ],
  name: <Buffer 53 69 6e 74 65 6c>,
  'piece length': 131072,
  pieces:
   <Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... 19690 more bytes> }
url-list: [ <Buffer 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 2f 74 6f 72 72 65 6e 74 73 2f> ]

That looks pretty good, but you'll notice that a lot of things that look like they should be strings are actually buffer arrays. So our decode_torrent file needs to be a bit more complicated after all. After we add some code to everything that looks like it should be a string, we have the following code:

function decode_torrent(data_string) {
  let torrent = parse_dictionary(data_string).value
  torrent.announce = torrent.announce.toString()
  torrent['announce-list'] = torrent['announce-list'].map(buffer => {
    return buffer.toString()
  })
  torrent['created by'] = torrent['announce-list'].toString()
  torrent['url-list'] = torrent['announce-list'].map(buffer => {
    return buffer.toString()
  })
  torrent.comment = torrent.comment.toString()
  torrent.encoding = torrent.encoding.toString()
  torrent.info.name = torrent.info.name.toString()
  torrent.info.files = torrent.info.files.map(file => {
    file.path = file.path.toString()
    return file
  })
  return torrent
}

and the following output:

{ announce: 'udp://tracker.leechers-paradise.org:6969',
  'announce-list':
   [ 'udp://tracker.leechers-paradise.org:6969',
     'udp://tracker.coppersurfer.tk:6969',
     'udp://tracker.opentrackr.org:1337',
     'udp://explodie.org:6969',
     'udp://tracker.empire-js.us:1337',
     'wss://tracker.btorrent.xyz',
     'wss://tracker.openwebtorrent.com',
     'wss://tracker.fastcast.nz' ],
  comment: 'WebTorrent <https://webtorrent.io>',
  'created by':
   'udp://tracker.leechers-paradise.org:6969,udp://tracker.coppersurfer.tk:6969,udp://tracker.opentrackr.org:1337,udp://explodie.org:6969,udp://tracker.empire-js.us:1337,wss://tracker.btorrent.xyz,wss://tracker.openwebtorrent.com,wss://tracker.fastcast.nz',
  'creation date': 1490916637,
  encoding: 'UTF-8',
  info:
   { files:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     name: 'Sintel',
     'piece length': 131072,
     pieces:
      <Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... 19690 more bytes> },
  'url-list':
   [ 'udp://tracker.leechers-paradise.org:6969',
     'udp://tracker.coppersurfer.tk:6969',
     'udp://tracker.opentrackr.org:1337',
     'udp://explodie.org:6969',
     'udp://tracker.empire-js.us:1337',
     'wss://tracker.btorrent.xyz',
     'wss://tracker.openwebtorrent.com',
     'wss://tracker.fastcast.nz' ] }

The final version of this code can be found at the companion Gitlab repository.

The end bit

Congratulations! We now have a working torrent file parser written from scratch. The next thing we want to work on is probably connecting to a tracker server and asking for some peers. I'm pretty sure that's how torrents work. Either way, that's research to be done next time.

Once again, if there are any problems understanding the content or if you find any bugs, please let me know.

Post index