The starting point of a BitTorrent client is the torrent file. This file tells the client about the file it is about to download. It has a description of where the client can find peers, how large the final file is expected to be, as well as chunk signatures so that the download can be verified as correct. For this post, I am going to be using sintel.torrent as an example.
A torrent file is bencoded (pronounced b-encoded), which means we will need to write a small parser in order to understand the torrent file.
But first we need to write some CLI program boilerplate.
CLI program shell
In order to create a command line program, you need to read the program arguments. Without arguments you don't know what the program is supposed to do. In our case, we want to locate the file that the user wants to download. To refresh your memory, our command line interface is going to be used as snag fetch <torrent-file>
.
In node.js, the program arguments are stored in a default global called process.argv
. In our case, if we run snag fetch example.torrent
, the process.argv
will look like this:
[ '/usr/bin/node',
'/path/to/project/snag.js',
'fetch',
'example.torrent' ]
So to get the torrent file path we need to check for the 4th entry in argv, or process.argv[3]
. If the user didn't enter a torrent file path, we should show a good error.
let torrent_file_path = process.argv[3]
if (!torrent_file_path) {
console.error('Usage: snag fetch <torrent-file>')
process.exit(1)
}
Once we have the path, we need to check if it exists so we can show a good error if the path doesn't exist.
if (!fs.existsSync(torrent_file_path)) {
console.error(torrent_file_path+' does not exist')
process.exit(1)
}
Then we need to load the torrent file and parse the data out of it.
let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
let torrent_data = decode_torrent(bencoded_torrent_data)
console.log('The torrent you are trying to download:')
console.log(torrent_data)
Bencoded basics
In order to make our decode_torrent
function, we first need to discuss the basics of bencoded data. The format is covered in the spec and in more thorough detail in the meta spec, but here is a brief summary:
- Byte strings are encoded as
<length>:<text>
, so6:foobar
is "foobar". An empty string is encoded0:
. - Number literals are wrapped between
i
ande
(e.g.i42e
is 42). - A list begins with
l
and ends withe
. Everything between are the list items.
For example:li1ei2ei3ee
is[1, 2, 3]
- Dictionaries begin with
d
and end withe
. Everything between is alternating key/value sets.d3:foo3:bare
is{foo: 'bar'}
. - Note that both dictionaries and lists can be empty.
de
andle
respectively.
If we peek at the beginning of our sintel.torrent, we will see this text: d8:announce40:udp...
. The d
at the beginning marks the start of a dictionary, so we will interpret the following values as alternating key-values until we encounter the closing e
. Next, the text 8:announce
means that the first key in the dictionary is "announce". The next value we see is 40:udp://tracker.leechers-paradise.org:6969
(the first value in the dictionary). So our dictionary at this point in the parsing is {announce: 'udp://tracker.leechers-paradise.org:6969'}
.
Bencode parser
We're going to implement the bencoding parser as a standalone module.
To write a parser that handles a bencoded document, we first need to write functions that can decode each of the value types (string, number, list, dictionary). Instead of directly returning the parsed value, each individual parser function will return a "token" object of which looks like this: {value: ..., length: ...}
. "value" will be the parsed value, and "length" will be the total number of characters that were read from the original string to produce the final parsed value. Keeping the length like this will make it easier to keep track of where we are in the parse.
One of the simplest values to parse is a number, let's handle that first.
Number parser
To parse a bencoded number we need to pop the i
off the front and the e
from the end. The remaining result can be converted directly to a number. The location of i
is easy, since we expect it always to be the beginning of the value. The location of e
is also easy since we can just search for the first e
in the string. A number cannot have any child values, so the first e
we find will always be the end of the current number.
To find the length of the integer token, we need to take the index of the e
and add 1 (because strings are 0 indexed).
So the final function looks like this:
exports.parse_number = data => {
let start = 1
let end = data.indexOf('e')
if (end === -1) {
throw new Error('Invalid number: '+data.slice(0, 30))
}
let length = end + 1 // because we're including the final 'e'
let value = parseInt(data.slice(start, end))
return {value: value, length: length}
}
We can test that our function is working correctly by adding some tests to our test file (using ava as the testing framework).
import test from 'ava'
import bencoding from './bencoding.js'
test('parsing numbers', t => {
// normal numbers
t.deepEqual(bencoding.parse_number('i234e'),
{value: 234, length: 5})
// negative numbers
t.deepEqual(bencoding.parse_number('i-5500e'),
{value: -5500, length: 7})
// ensure invalid numbers throw an error
t.throws(() => {
bencoding.parse_number('i-5500')
})
})
Byte string parser
Byte strings are also quite easy. We find the location of the :
, then parse everything up to it as a number. Then we take that many bytes from the data buffer and return it as the value.
exports.parse_byte_string = data => {
let number_end = data.indexOf(':')
let string_start = number_end + 1
let string_length = parseInt(data.slice(0, number_end))
let value = data.slice(string_start,
string_start + string_length)
let length = string_start + string_length
return {value: value, length: length}
}
Once again, we can run some tests to show that the code is correct:
test('parsing byte strings', t => {
t.deepEqual(bencoding.parse_byte_string('6:foobar'),
{value: 'foobar', length: 8})
t.deepEqual(bencoding.parse_byte_string('12:foobarfoobari50e'),
{value: 'foobarfoobar', length: 15})
})
Generic value parser
Lists and dictionaries can both contain sub-values (including more dictionaries and lists). In order to support these datatypes, we need a generic function that can parse an unknown value out of a data string.
Luckily, it's easy to see what the datatype is going to be. For dictionaries, lists, and numbers, the first character will start with d
, l
, or i
respectively. To check for a byte string, just walk through the data string until you encounter a non-number or a :
.
Let's first make some helper functions:
function char_at(buffer, index) {
return buffer.slice(index, index+1).toString('utf8')
}
function is_byte_string(data) {
for (let i = 0; i < data.length; i++) {
let char = char_at(data, i)
// ':' cannot be the first character in data
// If this ever becomes true, then we know that every character
// that we found up to this point was a number.
// If we encountered a non-number, the function would have returned
if (i > 0 && char === ':') {
return true
}
if (!'0123456789'.includes(char)) {
return false
}
}
throw new Error('Found only digits: '+data.slice(0, 50))
}
And then our generic type parser:
exports.parse_any = data => {
switch (char_at(data, 0)) {
case 'd': return exports.parse_dictionary(data)
case 'l': return exports.parse_list(data)
case 'i': return exports.parse_number(data)
}
if (is_byte_string(data)) {
return exports.parse_byte_string(data)
}
throw new Error('Encountered unknown token: '+data.slice(0, 30).toString())
}
We can only write limited tests for this right now, since our parse_list
and parse_dictionary
functions don't exist yet.
test('generic parsing', t => {
t.deepEqual(bencoding.parse_any('6:foobar'),
{value: 'foobar', length: 8})
t.deepEqual(bencoding.parse_any('i234e'),
{value: 234, length: 5})
})
List parser
With this generic value parser in place, we can now write our list parser. Lists can contain sub-values, so now we need to start referencing the token lengths we've been storing so far.
To parse the list, we need to first skip the l
that marks the beginning of the list. Then we keep consuming unknown values from the data string until the first unprocessed character is an e
, which marks the end of the list. For every child value we encounter, we discard it's source characters from the data string and add the child length to the length of the list.
exports.parse_list = data => {
let length = 2 // list is wrapped in `l/e`, so the length is at least 2
let list_value = []
data = data.slice(1) // discard the first 'l'
while (char_at(data, 0) !== 'e') {
if (!data.length) {
throw new Error('Unterminated list')
}
let next_value = exports.parse_any(data)
list_value.push(next_value.value)
length += next_value.length
data = data.slice(next_value.length)
}
return {value: list_value, length: length}
}
And our ava tests should that show that the code is working:
test('parsing lists', t => {
t.deepEqual(bencoding.parse_list(`le`),
{value: [], length: 2})
t.deepEqual(bencoding.parse_list(`li1ei2ei3ee`),
{value: [1, 2, 3], length: 11})
t.deepEqual(bencoding.parse_list(`l6:foobare`),
{value: ['foobar'], length: 10})
t.throws(() => {
// this list is incorrectly terminated
bencoding.parse_list(`li1ei2ei3e`)
})
})
Dictionary parser
Finally, the dictionary parser. The code for this one is very similar to the list parser code. The only difference is that we need to alternately read the child items as keys and values.
exports.parse_dictionary = data => {
let length = 2 // list is wrapped in `d/e`, so the length is at least 2
let dict_value = {}
data = data.slice(1) // discard the first 'd'
let key
while (char_at(data, 0) !== 'e') {
if (!data.length) {
throw new Error('Unterminated dictionary')
}
let next_value = exports.parse_any(data)
length += next_value.length
data = data.slice(next_value.length)
if (key) {
dict_value[key] = next_value.value
key = null
} else {
key = next_value.value
}
}
return {value: dict_value, length: length}
}
And the tests:
test('parsing dictionaries', t => {
t.deepEqual(bencoding.parse_dictionary('d3:foo3:bare'),
{value: {foo: 'bar'}, length: 12})
t.deepEqual(bencoding.parse_dictionary('de'),
{value: {}, length: 2})
t.throws(() => {
bencoding.parse_dictionary('d')
})
})
And finally we'll make a parse
function to give the module a clean interface for other code to call:
// so we can call bencoding.parse(data)
exports.parse = data => {
return exports.parse_any(data).value
}
Parsing the Torrent File
So now the bones of our program look like this.
const bencoding = require('./bencoding.js')
const fs = require('fs')
let torrent_file_path = process.argv[2]
if (!torrent_file_path) {
console.error('Usage: snag fetch <torrent-file>')
process.exit(1)
}
if (!fs.existsSync(torrent_file_path)) {
console.error(torrent_file_path+' does not exist')
process.exit(1)
}
let bencoded_torrent_data = fs.readFileSync(torrent_file_path)
let torrent_data = bencoding.parse(bencoded_torrent_data)
console.log('The torrent you are trying to download:')
console.log(torrent_data)
And the result of running our program on sintel.torrent
like this:
The torrent you are trying to download:
{ announce:
<Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39>,
'announce-list':
[ [ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6c 65 65 63 68 65 72 73 2d 70 61 72 61 64 69 73 65 2e 6f 72 67 3a 36 39 36 39> ],
[ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 63 6f 70 70 65 72 73 75 72 66 65 72 2e 74 6b 3a 36 39 36 39> ],
[ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 74 72 61 63 6b 72 2e 6f 72 67 3a 31 33 33 37> ],
[ <Buffer 75 64 70 3a 2f 2f 65 78 70 6c 6f 64 69 65 2e 6f 72 67 3a 36 39 36 39> ],
[ <Buffer 75 64 70 3a 2f 2f 74 72 61 63 6b 65 72 2e 65 6d 70 69 72 65 2d 6a 73 2e 75 73 3a 31 33 33 37> ],
[ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 62 74 6f 72 72 65 6e 74 2e 78 79 7a> ],
[ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 6f 70 65 6e 77 65 62 74 6f 72 72 65 6e 74 2e 63 6f 6d> ],
[ <Buffer 77 73 73 3a 2f 2f 74 72 61 63 6b 65 72 2e 66 61 73 74 63 61 73 74 2e 6e 7a> ] ],
comment:
<Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>,
'created by':
<Buffer 57 65 62 54 6f 72 72 65 6e 74 20 3c 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 3e>,
'creation date': 1490916637,
encoding: <Buffer 55 54 46 2d 38>,
info:
{ files:
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
name: <Buffer 53 69 6e 74 65 6c>,
'piece length': 131072,
pieces:
<Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... > },
'url-list':
[ <Buffer 68 74 74 70 73 3a 2f 2f 77 65 62 74 6f 72 72 65 6e 74 2e 69 6f 2f 74 6f 72 72 65 6e 74 73 2f> ] }
That looks pretty good, but you'll notice that there are a lot of <Buffer ...>
s in the output. This is because bencoded data doesn't make any distinction between human readable text and data buffers. In order to make this a little more workable, we need to make a decode_torrent
function to clean up our data types a bit. There are some fields that we know should be a string, so we can write the following code to convert everything:
function decode_torrent(data) {
let torrent = bencoding.parse(data)
torrent.announce = torrent.announce.toString()
torrent['announce-list'] = torrent['announce-list'].map(buffer => {
return buffer.toString()
})
torrent['created by'] = torrent['created by'].toString()
torrent['url-list'] = torrent['url-list'].map(buffer => {
return buffer.toString()
})
torrent.comment = torrent.comment.toString()
torrent.encoding = torrent.encoding.toString()
torrent.info.name = torrent.info.name.toString()
torrent.info.files = torrent.info.files.map(file => {
file.path = file.path.toString()
return file
})
return torrent
}
and the following program output:
The torrent you are trying to download:
{ announce: 'udp://tracker.leechers-paradise.org:6969',
'announce-list':
[ 'udp://tracker.leechers-paradise.org:6969',
'udp://tracker.coppersurfer.tk:6969',
'udp://tracker.opentrackr.org:1337',
'udp://explodie.org:6969',
'udp://tracker.empire-js.us:1337',
'wss://tracker.btorrent.xyz',
'wss://tracker.openwebtorrent.com',
'wss://tracker.fastcast.nz' ],
comment: 'WebTorrent <https://webtorrent.io>',
'created by':
'udp://tracker.leechers-paradise.org:6969,udp://tracker.coppersurfer.tk:6969,udp://tracker.opentrackr.org:1337,udp://explodie.org:6969,udp://tracker.empire-js.us:1337,wss://tracker.btorrent.xyz,wss://tracker.openwebtorrent.com,wss://tracker.fastcast.nz',
'creation date': 1490916637,
encoding: 'UTF-8',
info:
{ files:
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
name: 'Sintel',
'piece length': 131072,
pieces:
<Buffer 21 48 c7 3c 5b f0 19 c2 f5 96 04 af a8 f5 be 2b c0 c6 07 0f e1 e9 be 98 b8 b9 9a 99 aa 44 e0 52 e8 c6 5e 30 09 8a 18 22 44 8e 73 25 2a 0a d6 fc 88 bb ... > },
'url-list':
[ 'udp://tracker.leechers-paradise.org:6969',
'udp://tracker.coppersurfer.tk:6969',
'udp://tracker.opentrackr.org:1337',
'udp://explodie.org:6969',
'udp://tracker.empire-js.us:1337',
'wss://tracker.btorrent.xyz',
'wss://tracker.openwebtorrent.com',
'wss://tracker.fastcast.nz' ] }
Re-encoding data
When we talk to a torrent tracker later on, we're going to need to calculate the signature (hash) of only parts of the torrent file, which means that we need the ability to re-encode values back to bencoded text before we hash it.
The encoding code is simpler than the decoding code:
Numbers
exports.encode_number = value => {
return `i${value}e`
}
and the tests:
test('encoding numbers', t => {
t.deepEqual(bencoding.encode_number(42), 'i42e')
t.deepEqual(bencoding.encode_number(-42), 'i-42e')
})
Byte strings
exports.encode_byte_string = buffer => {
return `${buffer.length}:${buffer}`
}
and the tests:
test('encoding byte strings', t => {
t.deepEqual(bencoding.encode_byte_string('hi there'),
'8:hi there')
})
Generic
Just like with decoding, we need the ability to encode a generic value in order to deal with lists and dictionaries which can have children of any type.
exports.encode = data => {
if (Array.isArray(data)) {
return exports.encode_list(data)
}
if (typeof data === 'object') {
return exports.encode_dictionary(data)
}
if (typeof data === 'string' || data instanceof Buffer) {
return exports.encode_byte_string(data)
}
if (typeof data === 'number') {
return exports.encode_number(data)
}
throw new Error(`Unknown data type: ${data}`)
}
Lists
exports.encode_list = list => {
let children = list.map(child => {
return exports.encode(child)
}).join('')
return `l${children}e`
}
and tests:
test('encoding lists', t => {
t.deepEqual(bencoding.encode_list([1, 2, 'hi']), 'li1ei2e2:hie')
})
Dictionaries
exports.encode_dictionary = dict => {
let result = 'd'
for (let [key, value] of Object.entries(dict)) {
result += exports.encode(key)+exports.encode(value)
}
return result+'e'
}
and tests:
test('encoding dictionaries', t => {
t.deepEqual(bencoding.encode_dictionary({foobar: [1, 2, 3]}),
'd6:foobarli1ei2ei3eee')
})
The end bit
Congratulations! We now have a working torrent file parser written from scratch, including the ability to parse and re-encode bencoded data.
The final version of this code can be found in the companion Gitlab repository.
The next post will focus on how to to talk to trackers.
Once again, if there are any problems understanding the content or if you find any bugs, please let me know.