The starting point of a BitTorrent client is the torrent file. This file tells the client about the file it is about to download. It has a description of where the client can find peers, how large the file is expected to be, as well as chunk signatures so that the download can be verified. For this post, I am going to be using sintel.torrent as an example.

A torrent file is bencoded (pronounced b-encoded), which means we will need to write a small parser in order to understand the torrent file.

But first we need to write some CLI program boilerplate.

CLI program shell

In order to create a command line program, you need to read the program arguments. Without arguments you don't know what the program is supposed to do. In our case, we want to locate the file that the user wants to download. To refresh your memory, our command line interface is going to be used as snag fetch <torrent-file>.

In node.js, the program arguments are stored in a default global called process.argv. In our case, if we run snag fetch example.torrent, the process.argv will look like this:

[ '/usr/bin/node',
  '/path/to/project/snag.js',
  'fetch',
  'example.torrent' ]

So to get the torrent file path we need to check for the 4th entry in argv, or process.argv[3]. If the user didn't enter a torrent file path, we should show a good error.

let torrent_file_path = process.argv[3]
if (!torrent_file_path) {
	console.error('Usage: snag fetch <torrent-file>')
	process.exit(1)
}

Once we have the path, we need to check if it exists so we can show a good error if the path doesn't exist.

if (!fs.existsSync(torrent_file_path)) {
    console.error(torrent_file_path+' does not exist')
    process.exit(1)
}

Then we need to load the torrent file and parse the data out of it.

let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
let torrent_data = decode_torrent(bencoded_torrent_data)
console.log('The torrent you are trying to download:')
console.log(torrent_data)

Bencoded basics

In order to make our decode_torrent function, we first need to discuss the basics of bencoded data. The format is covered in the spec and in more thorough detail in the meta spec, but here is a brief summary:

  • Byte strings are encoded as <length>:<text>, so 6:foobar is "foobar". An empty string is encoded 0:.
  • Number literals are wrapped between i and e (e.g. i42e is 42).
  • A list begins with l and ends with e. Everything between are the list items.
    For example: li1ei2ei3ee is [1, 2, 3]
  • Dictionaries begin with d and end with e. Everything between is alternating key/value sets. d3:foo3:bare is {foo: 'bar'}.
  • Note that both dictionaries and lists can be empty. de and le respectively.

If we peek at the beginning of our sintel.torrent, we will see this text: d8:announce40:udp.... The d at the beginning marks the start of a dictionary, so we will interpret the following values as alternating key-values until we encounter the closing e. Next, the text 8:announce means that the first key in the dictionary is "announce". The next value we see is 40:udp://tracker.leechers-paradise.org:6969 (the first value in the dictionary). So our dictionary at this point in the parsing is {announce: 'udp://tracker.leechers-paradise.org:6969'}.

Bencode parser

We're going to implement the bencoding parser as a standalone module.

To write a parser that handles a bencoded document, we first need to write functions that can decode each of the value types. Instead of directly returning the parsed value, each parsing function will return a "token" object of which looks like this: {value: ..., length: ...}. "value" will be the parsed value, and "length" will be the number of characters that were consumed to produce the final parsed value. Keeping the length like this will make it easier to chop up the input string in the more complex data type parsers.

One of the simplest values to parse is a number, let's handle that first.

Number parser

To parse a bencoded number we need to pop the i off the front and the e from the end. The remaining result can be converted directly to a number. The location of i is easy, since we expect it always to be the beginning of the value. The location of e is also easy since we can just search for the first e in the string. A number cannot have nested values, so the first e we find will always be the end of the current number.

To find the length of the integer token, we need to take the index of the e and add 1 (because strings are 0 indexed).

So the final function looks like this:

exports.parse_number = data => {
	let start = 1
	let end = data.indexOf('e')
	if (end === -1) {
		throw new Error('Invalid number: '+data.slice(0, 30))
	}
	let length = end + 1
	let value = parseInt(data.slice(start, end))
	return {value: value, length: length}
}

We can test that our function is working correctly by adding some tests to our test file (using ava as the testing framework).

import test from 'ava'
import bencoding from './bencoding.js'

test('parsing numbers', t => {
	t.deepEqual(bencoding.parse_number('i234e'),
		{value: 234, length: 5})
	t.deepEqual(bencoding.parse_number('i-5500e'),
		{value: -5500, length: 7})
	t.throws(() => {
		bencoding.parse_number('i-5500')
	})
})

Byte string parser

Byte strings are also quite easy. We find the location of the :, then parse everything up to it as a number. Then we take that many bytes from the data buffer and return it as the value.

exports.parse_byte_string = data => {
	let number_end = data.indexOf(':')
	let string_start = number_end + 1
	let string_length = parseInt(data.slice(0, number_end))
	let value = data.slice(string_start,
		string_start + string_length)
	let length = string_start + string_length
	return {value: value, length: length}
}

Once again, we can run some tests to show that the code is correct:

test('parsing byte strings', t => {
	t.deepEqual(bencoding.parse_byte_string('6:foobar'),
		{value: 'foobar', length: 8})
	t.deepEqual(bencoding.parse_byte_string('12:foobarfoobari50e'),
		{value: 'foobarfoobar', length: 15})
})

Generic value parser

Lists and dictionaries can both contain sub-values (including more dictionaries and lists). In order to support these datatypes, we need a generic function that can parse an unknown value out of a data string.

Luckily, it's easy to see what the datatype is going to be. For dictionaries, lists, and numbers, the first character will start with d, l, or i respectively. To check for a byte string, just walk through the data string until you encounter a non-number or a :.

Let's first make some helper functions:

function char_at(buffer, index) {
	return buffer.slice(index, index+1).toString('utf8')
}

function is_byte_string(data) {
	for (let i = 0; i < data.length; i++) {
		let char = char_at(data, i)
		// ':' cannot be the first character in data
		// If this ever becomes true, then we know that every character
		// that we found up to this point was a number.
		// If we encountered a non-number, the function would have returned
		if (i > 0 && char === ':') {
			return true
		}
		if (!'0123456789'.includes(char)) {
			return false
		}
	}
	throw new Error('Found only digits: '+data.slice(0, 50))
}

And then our generic type parser:

exports.parse_any = data => {
	switch (char_at(data, 0)) {
		case 'd': return exports.parse_dictionary(data)
		case 'l': return exports.parse_list(data)
		case 'i': return exports.parse_number(data)
	}
	if (is_byte_string(data)) {
		return exports.parse_byte_string(data)
	}
	throw new Error('Encountered unknown token: '+data.slice(0, 30).toString())
}

We can only write limited tests for this right now, since our parse_list and parse_dictionary functions don't exist yet.

test('generic parsing', t => {
	t.deepEqual(bencoding.parse_any('6:foobar'),
		{value: 'foobar', length: 8})
	t.deepEqual(bencoding.parse_any('i234e'),
		{value: 234, length: 5})
})

List parser

With this generic value parser in place, we can now write our list parser. Lists can contain sub-values, so now we need to start referencing the token lengths we've been storing so far.

To parse the list, we need to first skip the l that marks the beginning of the list. Then we keep consuming unknown values from the data string until the first unprocessed character is an e, which marks the end of the list. For every newly parsed child value, we discard it from the data string and add its length to the length of the list.

exports.parse_list = data => {
	let length = 2 // list is wrapped in `l/e`, so the length is at least 2
	let list_value = []
	data = data.slice(1) // discard the first 'l'
	while (char_at(data, 0) !== 'e') {
		if (!data.length) {
			throw new Error('Unterminated list')
		}
		let next_value = exports.parse_any(data)
		list_value.push(next_value.value)
		length += next_value.length
		data = data.slice(next_value.length)
	}
	return {value: list_value, length: length}
}

And the tests that show that the code is working:

test('parsing lists', t => {
	t.deepEqual(bencoding.parse_list(`le`),
		{value: [], length: 2})
	t.deepEqual(bencoding.parse_list(`li1ei2ei3ee`),
		{value: [1, 2, 3], length: 11})
	t.deepEqual(bencoding.parse_list(`l6:foobare`),
		{value: ['foobar'], length: 10})
	t.throws(() => {
		// this list is incorrectly terminated
		bencoding.parse_list(`li1ei2ei3e`)
	})
})

Dictionary parser

Finally, the dictionary parser. The code for this one is very similar to the list parser code. The only difference is that we alternately set parsed values as keys and values.

exports.parse_dictionary = data => {
	let length = 2 // list is wrapped in `d/e`, so the length is at least 2
	let dict_value = {}
	data = data.slice(1) // discard the first 'd'
	let key
	while (char_at(data, 0) !== 'e') {
		if (!data.length) {
			throw new Error('Unterminated dictionary')
		}
		let next_value = exports.parse_any(data)
		length += next_value.length
		data = data.slice(next_value.length)
		if (key) {
			dict_value[key] = next_value.value
			key = null
		} else {
			key = next_value.value
		}
	}
	return {value: dict_value, length: length}
}

And the tests:

test('parsing dictionaries',  t => {
	t.deepEqual(bencoding.parse_dictionary('d3:foo3:bare'),
		{value: {foo: 'bar'}, length: 12})
	t.deepEqual(bencoding.parse_dictionary('de'),
		{value: {}, length: 2})
	t.throws(() => {
		bencoding.parse_dictionary('d')
	})
})

And finally we'll make a parse function to give the module a clean external interface:

exports.parse = data => {
	return exports.parse_any(data).value
}

Parsing the Torrent File

So now the bones of our program look like this.

const bencoding = require('./bencoding.js')
const fs = require('fs')

let torrent_file_path = process.argv[2]
if (!torrent_file_path) {
	console.error('Usage: snag fetch <torrent-file>')
	process.exit(1)
}
if (!fs.existsSync(torrent_file_path)) {
	console.error(torrent_file_path+' does not exist')
	process.exit(1)
}
let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
let torrent_data = bencoding.parse(bencoded_torrent_data)
console.log('The torrent you are trying to download:')
console.log(torrent_data)

And the result of running our program on sintel.torrent like this:

The torrent you are trying to download:
{ announce:
   <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39>,
  'announce-list':
   [ [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39> ],
     [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 63 6f 70 70 65 72 73 75 72 66 65 72 2e 74 6b 3a 36 39 36 39> ],
     [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 74 72 61 63 6b 72 2e 6f 72 67 3a 31 33 33 37> ],
     [ <Buffer 75 64 70 3a 2f 2f 65 78 70 6c 6f 64 69 65 2e 6f 72 67 3a 36 39 36 39> ],
     [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 65 6d 70 69 72 65 2d 6a 73 2e 75 73 3a 31 33 33 37> ],
     [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 62 74 6f 72 72 65 6e 74 2e 78 79 7a> ],
     [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 77 65 62 74 6f 72 72 65 6e 74 2e 63 6f 6d> ],
     [ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 66 61 73 74 63 61 73 74 2e 6e 7a> ] ],
  comment:
   <Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>,
  'created by':
   <Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>,
  'creation date': 1490916637,
  encoding: <Buffer 55 54 46 2d 38>,
  info:
   { files:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     name: <Buffer 53 69 6e 74 65 6c>,
     'piece length': 131072,
     pieces:
      <Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... > },
  'url-list':
   [ <Buffer 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 2f 74 6f 72 72 65 6e 74 73 2f> ] }

That looks pretty good, but you'll notice that there are a lot of <Buffer ...>s in the output. This is because bencoded data doesn't make any distinction between human readable text and data buffers. In order to make this a little more workable, we need to make a  decode_torrent function to convert our data types a bit. After we convert everything that should be a string, we have the following code:

function decode_torrent(data) {
	let torrent = bencoding.parse(data)
	torrent.announce = torrent.announce.toString()
	torrent['announce-list'] = torrent['announce-list'].map(buffer => {
		return buffer.toString()
	})
	torrent['created by'] = torrent['announce-list'].toString()
	torrent['url-list'] = torrent['announce-list'].map(buffer => {
		return buffer.toString()
	})
	torrent.comment = torrent.comment.toString()
	torrent.encoding = torrent.encoding.toString()
	torrent.info.name = torrent.info.name.toString()
	torrent.info.files = torrent.info.files.map(file => {
		file.path = file.path.toString()
		return file
	})
	return torrent
}

and the following program output:

The torrent you are trying to download:
{ announce: 'udp://tracker.leechers-paradise.org:6969',
  'announce-list':
   [ 'udp://tracker.leechers-paradise.org:6969',
     'udp://tracker.coppersurfer.tk:6969',
     'udp://tracker.opentrackr.org:1337',
     'udp://explodie.org:6969',
     'udp://tracker.empire-js.us:1337',
     'wss://tracker.btorrent.xyz',
     'wss://tracker.openwebtorrent.com',
     'wss://tracker.fastcast.nz' ],
  comment: 'WebTorrent <https://webtorrent.io>',
  'created by':
   'udp://tracker.leechers-paradise.org:6969,udp://tracker.coppersurfer.tk:6969,udp://tracker.opentrackr.org:1337,udp://explodie.org:6969,udp://tracker.empire-js.us:1337,wss://tracker.btorrent.xyz,wss://tracker.openwebtorrent.com,wss://tracker.fastcast.nz',
  'creation date': 1490916637,
  encoding: 'UTF-8',
  info:
   { files:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     name: 'Sintel',
     'piece length': 131072,
     pieces:
      <Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... > },
  'url-list':
   [ 'udp://tracker.leechers-paradise.org:6969',
     'udp://tracker.coppersurfer.tk:6969',
     'udp://tracker.opentrackr.org:1337',
     'udp://explodie.org:6969',
     'udp://tracker.empire-js.us:1337',
     'wss://tracker.btorrent.xyz',
     'wss://tracker.openwebtorrent.com',
     'wss://tracker.fastcast.nz' ] }

So our bencoding parser seems to be working!

Re-encoding data

When we talk to a torrent tracker later on, we're going to need to calculate the signatures of parts of the torrent file, which means that we also need the ability to convert data back to bencoded text. This looks like:

Numbers

exports.encode_number = value => {
	return `i${value}e`
}

and the tests:

test('encoding numbers', t => {
	t.deepEqual(bencoding.encode_number(42), 'i42e')
	t.deepEqual(bencoding.encode_number(-42), 'i-42e')
})

Byte strings

exports.encode_byte_string = buffer => {
	return `${buffer.length}:${buffer}`
}

and the tests:

test('encoding byte strings', t => {
	t.deepEqual(bencoding.encode_byte_string('hi there'),
		'8:hi there')
})

Generic

Just like with decoding, we need the ability to encode a generic value in order to deal with lists and dictionaries which can have children of any type.

exports.encode = data => {
	if (Array.isArray(data)) {
		return exports.encode_list(data)
	}
	if (typeof data === 'object') {
		return exports.encode_dictionary(data)
	}
	if (typeof data === 'string' || data instanceof Buffer) {
		return exports.encode_byte_string(data)
	}
	if (typeof data === 'number') {
		return exports.encode_number(data)
	}
	throw new Error(`Unknown data type: ${data}`)
}

Lists

exports.encode_list = list => {
	let children = list.map(child => {
		return exports.encode(child)
	}).join('')
	return `l${children}e`
}

and tests:

test('encoding lists', t => {
	t.deepEqual(bencoding.encode_list([1, 2, 'hi']), 'li1ei2e2:hie')
})

Dictionaries

exports.encode_dictionary = dict => {
	let result = 'd'
	for (let [key, value] of Object.entries(dict)) {
		result += exports.encode(key)+exports.encode(value)
	}
	return result+'e'
}

and tests:

test('encoding dictionaries', t => {
	t.deepEqual(bencoding.encode_dictionary({foobar: [1, 2, 3]}),
		'd6:foobarli1ei2ei3eee')
})

The end bit

Congratulations! We now have a working torrent file parser written from scratch, including the ability to parse and re-encode bencoded data.

The final version of this code can be found in the companion Gitlab repository.

The next post will focus on how to to talk to trackers.

Once again, if there are any problems understanding the content or if you find any bugs, please let me know.

Post index