Jump to content

Mirroring Wikimedia project XML dumps/Torrents

From Meta, a Wikimedia project coordination wiki

rTorrent is quite efficient at handling thousands of torrents and can be useful at spreading toollabs:dump-torrents on the trackers and DHT networks so that they're indexed on various BitTorrent search engines.

Interface

[edit]

It's easy to download all torrents at once:

wget -r -np -nH -nd -A torrent https://tools.wmflabs.org/dump-torrents/

This can take 1 hour from Labs or some hours on another server:

Total wall clock time: 4h 26m 15s
Downloaded: 128558 files, 325M in 41s (7.86 MB/s)

Then just press enter and type the pattern of the files then enter, e.g. *pages-meta*torrent or *7z*torrent. See https://github.com/rakshasa/rtorrent/wiki/User-Guide for more.

You can then start all torrents at once.

You can also use a watch directory and copy/move there the torrents you want to add.

Performance

[edit]

Adding several thousands of torrents at once is likely to "freeze" the rtorrent interface for a while and make it work at 100 % CPU for several minutes, but it usually recovers eventually. At startup, "loading" all the torrents previously added may take a minute every couple thousands torrents.

As rTorrent 0.9.2/0.13.2, it does not support webseeds, so it will just leech any added torrent and keep a 0-byte file open for each.

[Throttle off/ 20 KB] [Rate   5.3/  5.4 KB] [Port: 6922]                         [U 0/0] [D 0/14] [H 0/5000] [S 1615/1617/15000] [F 9435/50000]

With 3 trackers per torrents (including a broken one) and 10k torrents, quite a bit of connections are generated: about 30k at startup and 10k at most times.

$ lsof -i -c rtorrent | wc -l
12864

Most connections (most trackers and DHT) are UDP:

$ lsof -i -c rtorrent | grep -c UDP
15849

Changing the IP of broken trackers in /etc/hosts or reducing the curl timeout might help. By default rtorrent doesn't timeout connections much (perhaps because many private trackers are quite slow).

When "idling" as above with around 10k torrents, rtorrent uses about 1 GB RAM (700 MB RES) and about 15 CPU-minutes per hour on a Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz.

When adding all the *xml*torrent files (about 34k), with 3 trackers each, rtorrent consumes about 1.8 GB RAM (1.5 RES) and seems to spend 100 % CPU sending announcement to trackers, without actually succeeding.

[Throttle off/ 20 KB] [Rate   0.0/  0.0 KB] [Port: 6900]                        [U 0/0] [D 0/14] [H 0/5000] [S 1305/1307/15000] [F 33681/50000]

DHT

[edit]

DHT, unlike trackers, requires rtorrent to be connectable (public IP or port mapping, port open in the firewall).

To ensure that DHT is working, check the tracker.log file. If not, DHT may need to be bootstrapped with ctrl-x, dht.add_node = dht.transmissionbt.com , enter.

With the tested version of rtorrent, however, having thousands of torrents in DHT is likely to result in segmentation faults: https://github.com/rakshasa/rtorrent/wiki/Using-DHT#segmentation-faults

To verify that DHT is working and rtorrent can be reached to fetch metadata, add the info_hash on your torrent client at home and see if you get the torrent name etc. (On Transmission: ctrl-U, paste the hash, enter.)

Configuration

[edit]

In /etc/security/limits.conf (Debian), have something like

torrent-user-name soft nofile 50000
torrent-user-name hard nofile 100000

The ~/.rtorrent.rc can be something like:

directory = ~/rtorrent
session = ~/.rtorrent.session/
dht = auto
# We don't actually want to fill our disk
throttle.global_down.max_rate.set_kb = 20

schedule = watch_dir, 20, 10, "load.start=~/rtorrent/autodownload/*.torrent"
network.max_open_files.set = 50000
network.max_open_sockets.set = 15000
network.http.max_open.set = 5000
# No point waiting multiple seconds for DNS
network.http.dns_cache_timeout.set = 2

log.open_file = "rtorrent.log", "/var/log/rtorrent/rtorrent.log"
log.open_file = "tracker.log", "/var/log/rtorrent/tracker.log"
log.add_output = "info", "rtorrent.log"
#These are very spammy, useful to see every single connection to trackers with tail -F /var/log/rtorrent/tracker.log 
#log.add_output = "dht_debug", "tracker.log"
#log.add_output = "tracker_info", "tracker.log"

Deluge alternative

[edit]

Deluge is quite easy to use from the command line (see some advice) and probably harder to crash: it should be ok at least to seed the 7z torrents, which are about 3 thousands, but struggles a bit. It keeps less connections open and manages to publish torrents via DHT without a public IP nor port forwarding.

sudo apt install deluge-console deluged
# Make sure the torrent/download dirs and limits in ~/.config/deluge/core.conf make sense, e.g. don't use NFS
deluged
screen -d -m deluge-console
wget -r -np -nH -A torrent https://tools.wmflabs.org/dump-torrents/
cd dump-torrents/ ; for torrent in `find * -name "*7z.torrent"` ; do DIR=$(dirname $torrent); deluge-console "add -p /public/dumps/public/$DIR $torrent" ; sleep 10s ; done

Note, even just adding torrents consumes quite a bit of IO (probably from the ~/.config/deluge/state files, which are accessed frequently): make sure to have your config directory on a fast mount.

  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 9451 be/4 nemobis     5.96 M/s    0.00 B/s  0.00 %  9.50 % python /usr/bin/deluged
 9450 be/4 nemobis     0.00 B/s  890.21 K/s  0.00 %  0.23 % python /usr/bin/deluged

The daemon also tends to become unresponsive after a few hundreds commands: just kill it nicely and restart it, then resume your deluge-console or commands. When restarting the daemon with thousands of torrents, over a hour may be needed to fully resume.

The daemon sometimes opens way less connections than expected and may need to retry the announcements to trackers a few times. With about 8000 torrents, deluged may consume 150 % CPU just for "idling"; in such a case, reducing max_active_seeding and rotating torrents more quickly in the queue ([1]) may counterintuitively increase the number of announcements sent.

To see if there is any problem you don't even need to fire up an interface, you can just query Deluge with commands like

deluge-console "info -s Error"

Some torrents may be stuck in an error state just for a failure to check the local data, so we can recheck them all with a command like

for torrent in `deluge-console "info -s Error" | grep -B 5 "Progress: 0.00" | grep ID | sed "s/ID: //g"` ; do deluge-console recheck $torrent ; sleep 1s ; done

Fancier stuff is possible with Deluge RPC etc.