Jump to content

Talk:Mirroring Wikimedia project XML dumps

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 3 months ago by Emijrp in topic Wikimedia Commons tarballs

Email template

[edit]

This is a template for emailing organizations that may be interested in mirroring the dumps.

Subject: Request for a mirror of the Wikimedia public datasets

Dear Sir/Madam,

I am <YOUR NAME>, representing the Wikimedia Foundation as a volunteer for its projects. I have recently visited your website and I was hoping that your organization can kindly help in volunteering to mirror the Wikimedia public datasets on your site so that it can benefit researchers.

The Wikimedia public datasets contain database dumps of many of their own wikis, like the English Wikipedia. These dumps are available for download and researchers can use them to carry out research projects, most of the time in a way that could benefit the open source community and the many volunteer editors that help out on Wikipedia and its sister projects.

I sincerely hope that your organization can support free knowledge for a good cause and benefit everyone that are using the dumps. If you would like more details, please check out:

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

That page can provide more details on what is being mirrored and the size needed. If you require more assistance, please feel free to contact me or the official liaison at ariel@wikimedia.org (Ariel T. Glenn).

I look forward to a favorable reply and have a nice day. Thank you!

Requirements

[edit]

Availability

[edit]

What type of availability is required for a mirror: is low availability allowed, such as 85% uptime?

Organizations accepts/rejects

[edit]

Accepted organizations

[edit]
On-going
  • Archive.org (through third-party volunteers)
Expressed interest

Rejected organizations

[edit]

Ideas

[edit]

Amazon

[edit]

This is already supposed to be happening ;). We also shouldn't forget about all the media hosted at Commons - they seem to be rather vulnerable to software or human errors. Eug 15:39, 8 October 2011 (UTC)Reply

They now have a cheap option ("Amazon Glacier") designed for backups - it costs $0.01/GB. If they make the base data a Public Data Set (free for Wikimedia I suppose), Glacier can be used for incrementals. They also have an portable hard-disk import option for large amounts of data.

Netfirms

[edit]

I have hosting on Netfirms for $10/year with unlimited storage and 100000 GB/month bandwidth. Could Wikimedia try hosting on a commercial provider? — The preceding unsigned comment was added by 69.249.211.198 (talk) 15:04, 6 August 2011 (UTC)Reply

Generally, you get what you pay for with cheap web hosting. I'm not saying it's a bad idea, but if we start storing terabytes of data and consuming terabytes of traffic, I'd bet that some sort of "acceptable use policy" will present itself. Maybe for some of the smaller projects, this could be a good idea. LobStoR 15:20, 6 August 2011 (UTC)Reply

Library of Congress

[edit]

Library of Congress is going to save every public tweet. Why don't they save a copy of Wikipedia? emijrp (talk) 16:47, 10 September 2010 (UTC)Reply

iBiblio

[edit]

I have contacted iBiblio for hosting a copy of the latest dumps, working as a mirror of download.wikimedia.org. No response yet. emijrp (talk) 13:12, 15 November 2010 (UTC)Reply

Their response: Unfortunately, we do not have the resources to provide a mirror of wikipedia. Best of luck! See this message in the Xmldatadumps-l mailing list.

archive.org

[edit]

The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library (from http://www.archive.org/about/about.php). Phauly 11:20, 16 November 2010 (UTC)Reply

BitTorrent

[edit]

Are torrents useful? Does someone currently have disk space and willness enough to actually seed the existing ones? I don't like torrents for this because:

  • torrents are useful to save bandwidth, which is not our problem,
  • it's impossible to find seeders for every dump,
  • even if you download some dump, have Torrent and some bandwidth to share it I'm not sure that you'll like to keep the compressed dumps in addition to the uncompressed ones.

--Nemo 12:44, 16 November 2010 (UTC)Reply

I'm not sure if torrents are a good solution for this problem. Torrents depend on how many people seed them. These are _huge_ files. Also, the dumps change quickly (ok, English Wikipedia dump is from January 2010), but other Wikipedias as German or French have new dumps every month or so. I think that we can release a compilation torrent every year which contains all the 7z dumps (about 100 GB), perhaps, with the future X anniversary of Wikipedia it can be a success. Emijrp 11:32, 29 November 2010 (UTC)Reply
Torrents can use Wikimedia's http server as a "web seed" (example torrent with web seed) so at the very least they'll stay seeded for as long as Wikimedia keeps a particular dump on the site... but could potentially stay seeded for much longer. Torrents definitely won't make anything *worse*... LobStoR 18:10, 21 December 2010 (UTC)Reply
Torrents seem like a good idea to me. Rjwilmsi 14:58, 23 December 2010 (UTC)Reply
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link — The preceding unsigned comment was added by 71.194.190.179 (talk) 01:52, 13 March 2011 (UTC)Reply
This talk page is not really the most appropriate place to list out all of our wiki-dump torrents. We don't need Wikimedia to create torrents for us, anyway. See Data dumps#What about bittorrent? for a more complete list of existing enwiki (and more) data dumps - the torrent you just mentioned has already been listed on here since January 2011 (again, see Data dumps). Perhaps we should move that list from "Data dumps" and create an actual article which is specifically for listing wiki-dump torrents. LobStoR 23:10, 14 March 2011 (UTC)Reply
I was mistaken, your torrent is different from our existing torrent for 2011-01-15. Our currently-existing torrent uses Wikimedia's http servers as a web seed, too, for an accelerated download. Also,
  1. Wikimedia cannot reasonably "validate" these dumps (checked for vandalism and "quality") in a large-scale fashion.
  2. Wikimedia is already helping seed unofficial torrents, via their HTTP web servers (web seed).
  3. Wikimedia is the organization that runs the website Wikipedia. Just to clarify terminology.
Just a few notes in response to your post. Please feel free to contribute to our listing of torrents, and help seed our existing torrents :-) LobStoR 23:36, 14 March 2011 (UTC)Reply
Is it possible to use a service like Bittorrent sync? this would allow Torrent like behaviour and setting the share to read only.
This would cut strain on the main server and spread the sharing load to everyone sharing the dump.

To pick this discussion up, now that 9 years have passed:

  • BitTorrent V2 specification has now been implemented in the LibTorrent V2 library, which is used by several of the main torrent clients. One of the features of BitTorrent V2 is torrent mutability. This means that a torrent publisher (say, Wikimedia) could publish a torrent of the entirety of Wikipedia, and then on a daily basis publish updates to that same torrent. This would mean that anyone downloading and subsequently seeding the torrent would always be seeding the most updated version, without having to get a new torrent, and this is possible with software that is being used out in the wild right now.
  • I think that one of the main problems in the past with distributing Wikipedia over BitTorrent has been that there is not a big desire on the part of a downloader to seed the torrent that is inevitably outdated by the time they get it. There are lots of people who have the hard drive space and the bandwidth (see https://www.reddit.com/r/DataHoarder/ for examples of the likely hundreds of people who would get in on this), and would love to be a part of such an effort--but only if they knew that what they were seeding was current. And no one wants to start over on a multi-TB file every time someone updates Wikipedia.
  • I'm not sure if this kind of torrent could be started by anyone other than Wikimedia. It would have to be updated daily, and it would have to be signed by someone that was trusted by the community (so as to accept updates as they were posted to the torrent). But if Wikimedia could set it up, this could easily become the de-facto form of obtaining not only Wikipedia, but an always up-to-date Wikipedia.

--Fivestones (talk) 02:11, 21 October 2020 (UTC)Reply


BitTorrent on Burnbit

[edit]

I've been experimenting with creating .torrents of the database dump files using an automated web service called Burnbit. Please see the user sandbox I've created for testing, at User:LobStoR/data dump table/enwiki-20110115. Previously, I had been downloading and manually creating the web-seeded .torrents listed at data dump torrents, but this service simplifies the process to a single click.

Shortcomings:

  • When a dump is recreated (which happens occasionally), it is difficult to remove and recreate the .torrent on Burnbit (must "Report" a broken link and wait for Burnbit to respond)
  • Extremely large/small files cannot be used

I think this can encourage users to help create and seed torrents, by making it easy for everyone. Reference links:

I am hoping to talk to Ariel about possibly adding something like this as "official" Wikimedia torrents (since the md5sums are displayed on the Burnbit). Burnbit is great for this because additional http web seeds can also be added later. Please provide feedback here, and feel free to make changes to the sandbox if you see any problems or improvements. LobStoR 13:50, 26 June 2011 (UTC)Reply

RedIRIS

[edit]

en:RedIRIS http://www.rediris.es/ Emijrp 11:33, 16 November 2010 (UTC)Reply

We already contacted them last year on this topic, so we could help to ask them. --GlimmerPhoenix 23:13, 17 December 2010 (UTC)Reply

Academic computer network organizations

[edit]

en:Category:Academic computer network organizations. Emijrp 11:34, 16 November 2010 (UTC)Reply

Swedish University Network (SUNET) with it's extensive file-archive is kindly asked to host 14 September 2012.

Datto

[edit]

Datto Inc. has graciously offered to host a mirror, but seem to have disappeared. On 2 January, they listed themselves with an ETA of 8 days to going live on 10 January; it is now 26 January (24 days elapsed), and I don't see anything listed at the link they provided (wikipedia.dattobackup.com). Would Austin McChord or anyone from Datto please drop a note here with some sort of status update? Thanks! LobStoR 23:24, 26 January 2011 (UTC)Reply

There was a brief discussion on wikitech-l but it hasn't gotten anywhere since. You might try responding to that and asking for a status report. Cap'n Refsmmat 04:19, 28 January 2011 (UTC)Reply
I removed this entry from the list, since there has been no further action. LobStoR 12:45, 14 May 2011 (UTC)Reply

C3SL

[edit]

I've just sent a message to C3SL, brazillian mirror for Source Forge. Cross your fingers! Lugusto 14:42, 3 June 2011 (UTC)Reply

And we accepted. We're waiting for instructions. Contact carlos@fisica.ufpr.br — The preceding unsigned comment was added by 200.17.209.129 (talk) 00:36, 12 June 2011 (UTC)Reply
This mirror is now live. We might have a few glitches as we try to coordinate the exact current contents to be mirrored. :-) -- ArielGlenn 17:52, 13 October 2011 (UTC)Reply
The URL is wikipedia.c3sl.ufpr.br. Out of curiosity, how does C3SL get the files? Are they hashed during and/or after transfer to verify accuracy? LobStoR 19:38, 13 October 2011 (UTC)Reply
They get the files from rsync directly from http://dumps.wikimedia.org. --Hydriz (talk) 14:44, 8 May 2012 (UTC)Reply

I am really happy to see that over a decade later this mirror still works and it is updated! https://wikipedia.c3sl.ufpr.br/ Emijrp (talk) 19:12, 12 September 2024 (UTC)Reply

AARNet

[edit]

I've sent an email to AARNet, a mirror that was set up in 1998. Although I can't guarantee that it would allow us to have the XML dumps content (since they only focus on open source software), but cross your fingers! --Hydriz 10:55, 24 November 2011 (UTC)Reply

And they rejected, sigh. --Hydriz 03:54, 30 November 2011 (UTC)Reply

Sent another email to AARNET, lets hope for the best. Sha-256 (talk) 06:06, 9 January 2013 (UTC)Reply

"I remember the last request for this, and unfortunately our reasoning this time will be the same. The archive is too big for the present available capacity, and we’ve had no requests from any researchers (identifying themselves as such) for the data to be mirrored. We may revisit the decision later this year once a new iteration of Mirror is deployed, but until then, sorry, we won’t mirror it." ,not happening it seems Sha-256 (talk) 06:57, 9 January 2013 (UTC)Reply
I can certainly gather the names of some Australian researchers who would be interested in this. But the size would make it a better target for an RDSI node rather than AARNET. Researchers probably want more than just a mirrored dump; they would want it extracted and pre-processed in a number of ways for convenience in mining it in various ways. Most researchers who work with WIkipedia dumps have to do extensive preprocessing so the desire to do it once and share is definitely there. I am in conversation with an RDSI node and the size doesn't seem to faze them, but we would need folks to volunteer to help with preprocessing it. Kerry Raymond (talk) 20:59, 10 January 2013 (UTC)Reply


ipfs

[edit]

Would it make sense to just mirror content into ipfs as its created? There's a GitHub project to mirror zim files in but maybe we could just mirror the full data and incrementally write out all new versions?— The preceding unsigned comment was added by an unspecified user

This has been discussed several times, but it's only about the parsed HTML pages isn't it? --Nemo 12:08, 26 April 2018 (UTC)Reply

Incremental updates

[edit]

Just spitballing here, would it make sense for en.wiki to have yearly full dumps, and more frequent incremental updates as compressed diffs? This might be a solution in search of a problem, but it would make it much easier to keep an up to date dump (if people would want that, provided that they do keep the compressed dump, and dumps are fairly similar, I have no idea how any of those hold up). 80.56.9.191 21:57, 15 December 2010 (UTC)Reply

I'm not so technical, but as I understand it dumps are already "incremental", i.e. the previous dump is "prefetched" and used to build the next one, so your proposal wouldn't help to reduce times, but perhaps you were talking about bandwidth (you have to download less), and the answer here is that bandwidth is not a great issue AFAIK. --Nemo 11:10, 16 December 2010 (UTC)Reply
An incremental compressed format makes a lot of sense to me. SJ talk  23:45, 28 April 2012 (UTC)Reply

Commons dumps

[edit]

Can you also set up a page about mirroring Wikimedia Commons binary dumps? That's what I am most interested in. The Internet Archive (talk to raj) is quite willing to host such a dump, along with the ones listed here. SJ talk  23:45, 28 April 2012 (UTC)Reply

Binary dumps, do you refer to the textual dumps, or the image dumps of Wikimedia Commons? --Hydriz (talk) 14:45, 8 May 2012 (UTC)Reply
The image dumps! –SJ talk 

WickedWay

[edit]

Hello,

We want to host a mirror, but I'm wondering... How can we do it best? Cause we host a mirror at this moment that rsyncs with the masters of CentOS en Ubuntu and some more, but the only way to use rsync now is rsync from a other mirror?

And secondly, how often do we need to update?

Best, Huib

Our mirror server is located in Dronten, the Netherlands and has a 1Gbit uplink. Secondly we have 50TB storage with approximately 20 TB free. Huib talk Abigor 12:58, 21 May 2012 (UTC)Reply
For help with setting up a mirror, get in touch with Ariel. See Mirroring Wikimedia project XML dumps#Who can we contact for hosting a mirror of the XML dumps? for contact info. 64.40.57.98 06:22, 22 May 2012 (UTC)Reply

all Wikipedia on torrent

[edit]

It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.

How many TB ? --CortexA9 (talk) 15:42, 1 November 2012 (UTC)Reply

see: http://xowa.org/image_dbs.html

Masaryk University mirror

[edit]

Fwiw, it looks like the Masaryk University mirror (http://ftp.fi.muni.cz/pub/wikimedia/): 1. Stopped pulling updates in November 2012, and 2. Is a partial mirror, excluding the 'enwiki' dumps. --Delirium (talk) 15:52, 28 July 2013 (UTC)Reply

Volunteer for 2 TB

[edit]

Hi,

I'm an administrator of the French digital library en:Les Classiques des sciences sociales based in Quebec, Canada, and we want to volunteer to help mirroring Wikimedia project. We have about 2 TB of space for this on our server. We wanted to know if it can be useful and, if so, how much upload/download traffic it can represent ? I think that there is about one dump per month, so I suppose it means at least 2 TB of upload per month. --Simon Villeneuve 21:45, 26 August 2015 (UTC)

  • If you want a lot of traffic with little disk space usage, you should mirror http://download.kiwix.org/ (the load is adjustable, just tell the Kiwix maintainer how much you want).
  • In 2 TB it's hard to include any meaningful subset of the XML dumps, but you could probably fit all the *-pages-meta-current.xml.bz2 files or perhaps even all the *pages-meta-history*7z pages: these would probably be mostly unused, but some user may happen to be closer to your server and prefer it to the main ones.
  • If you want something more stable that you don't need to update and babysit yourself, you can just seed the media tarballs; nobody else is doing that, so the contribution would be useful. --Nemo 12:52, 28 August 2015 (UTC)Reply
Ok, thank you very much for your answer. I'll see what I can do. Simon Villeneuve 19:46, 18 September 2015 (UTC)

Sad state of mirrors

[edit]

It seems that dumps.wikimedia.org/mirrors.html and this page are not kept up-to-date and in sync with eachother.

  • Sweden: has recent dumps, but no 'latest' link
  • United Kingdom: last dump from March 2017, HTTPS certificate expired
  • Brazil: FTP dead, last dump from July 2017
  • Czech Republic: no dumps
  • United States (Your.org): seems OK.

Why encourage people to use mirrors when that will inevitably lead to disappointment? --Bdijkstra (talk) 10:47, 8 August 2017 (UTC)Reply

I've never had any dissatisfaction with mirrors. I regularly use your.org and umu.se, they provide a service orders of magnitude better than the "central" WMF server. I agree some of the others are less useful. --Nemo 12:24, 3 September 2017 (UTC)Reply

geo_tags sql dump

[edit]

I'm not sure this is the right place to talk about it, but the geo_tags dump does not contain the titles of the pages as far as I can see and I think this is a basic requirement. for example this file: https://dumps.wikimedia.org/hewiki/latest/ has a gt_name column but is empty for most entries. I basically want to use a dump to be better geo-search wiki entries on my site that have a location. Not sure how to contact someone who can help me with it...

Masaryk Uni (CZ) still listed on the public mirror page

[edit]

While it's not in the list here (nor does it work anymore). Should be removed form there, too. And this make me feel like a complete crosscheck would be prudent. --grin 12:25, 20 November 2017 (UTC)Reply

Re-examining Compression

[edit]

More than bandwidth, storage space seems to be the largest obstacle when looking for mirrors.

Currently, gzip is employed in compression of the XML dumps. Perhaps we should look at using xz compression instead.

I've run a simple experiment to show the potential advantages and disadvantages of both compression algorithms, at various levels. Results:

Test Equipment and Parameters

[edit]

File used was enwiki-20180801-abstract.xml (available on the dumps). Original size is 5.2G, which is reduced to 683M using the current compression method.

Server used was:

  • Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz
  • SATA 7200rpm HDD (6gbps)
  • 32G RAM
  • (currently is a production server)

I suspect that WikiMedia has much higher processing ability.

Compression Test Results

[edit]
Compression Type Compression Level Original Size Compressed Size (M) Compression Time (s) Δ Size (M) Δ Time (s)
gzip current (6) 5.2G 683 91 0 0
gzip 9 5.2G 674 167 -9 76
xz 1 5.2G 580 42 -103 -49
xz 2 5.2G 552 63 -131 -28
xz 3 5.2G 536 101 -147 10
xz 4 5.2G 543 125 -140 34
xz 5 5.2G 521 183 -162 92
xz 6 5.2G 478 302 -205 211
xz 9 5.2G 456 415 -227 324

As seen in the table above, xz-2 is not only ~31% faster, but it also uses ~19% less space.

If we extrapolate that to the entire 17T dump, we could potentially see a space savings of ~3.25T, or an archive size of ~13.75T.

If we look at xz-3, we could consider sacrificing ~11% performance hit for ~22% space savings, or a new archive size of ~13.5T.

Certainly, a performance hit is not necessary if we can achieve higher compression ratio in less time.

Both algorithms are multi-platform and free and open source. Most big distributions have pre-compiled binaries allowing simple installation; ie: apt install xz

Other Projects Using xz

[edit]

Projects currently using xz:

  • Kernel.org
  • Several linux distributions (debian is just example)
  • GNU archives
  • CPAN (perl) archives
  • imagemagick
  • GIMP
  • LibreOffice (The Document Foundation)

xz May be Greener

[edit]

One could also make a "green" argument for higher compression. A study ( https://aceee.org/files/proceedings/2012/data/papers/0193-000409.pdf ) found that each GB of internet utilization uses approximately 5kwh of electricity, or USD$0.51. Whether those figures are accurate is irrelevant: 22% less transferred data is 22% less energy--whatever it's cost or power consumption.

Compression occurs once, but the difference is saved every time it is thereafter transferred.

Ease of Implementation

[edit]

The scripts which currently compress the dumps could be very easily reconfigured to use xz.

Discussion

[edit]

In the game of encouraging mirroring, I believe this could be a huge help.

It could reduce the "last-2-good" set from 4.5T to ~3.6T at xz-3. It saves time, and it saves bandwidth at the same time.

I see clear advantages without any disadvantages, however, I may be missing something.

Let's discuss.  :)

-Adam — The preceding unsigned comment was added by FreeMirrorOrg (talk) 18:39, 25 October 2018 (UTC)Reply

Another way to potentially reduce the compressed size is by adjusting the order of the pages, as normally the size of the sliding window is a bottleneck. Also, xz does have disadvantages, but I don't know the requirements of the Dumps Team. --bdijkstra (talk) 07:33, 27 October 2018 (UTC)Reply

Myanmar

[edit]

ဘာမှမဖြစ်ပါဘူး Mi wai lay (talk) 20:09, 13 February 2022 (UTC)Reply

old space estimates

[edit]

I suggest updating the space requirements numbers on this page. So9q (talk) 10:19, 11 September 2024 (UTC)Reply

Wikimedia Commons tarballs

[edit]
File:Garzweiler Tagebau-1230.jpg original file is 4,876 × 3,251 pixels, 5.28 MB, but the preview is only 800 × 533 pixels, 169 KB, over 30 times smaller.

I want to open again the discussion about generating a backup of Wikimedia Commons. I remember a tarball was generated around 2012-2013. Currently, the size of all images are over 379 TB. The "problem" has worsened. Could be possible to generate a backup of the previews only (typically ~800px)? On the right there is an example. The current 379 TB could be reduced to 10-20TB and split in yearly packs (by upload date). What do you think? Thanks. Emijrp (talk) 19:27, 12 September 2024 (UTC)Reply