Wikimedia CH/Grant apply/Scaling Wikidata through Benchmarking, Part 2

Infodata

Name of the project: Scaling Wikidata through Benchmarking, Part 2
Amount requested: 10,000 CHF
Type of grantee: Group
Name of the contact: Samuel Klein
Contact: meta.sjgmail.com and PFPS

In case of questions, please write to grantwikimedia.ch

The problem and the context

The first Scaling Wikidata project showed that QLever is 3-7x faster at query execution than Blazegraph, with many fewer timeouts, and faster also than Virtuoso and MillenniumDB in most contexts. It was also dramatically faster at loading a Wikidata dump, solving one potential critical failure. (link to Phab bug) We want to extend that work to estimate how well a new backend could scale, and characterize the remaining obstacles to using it as a replacement backend.

What is the problem you're trying to solve?

We want to estimate the capacity of alternative Wikidata backends to scale to 2x or more the current size of Wikidata, and to 10x or more the current size of the scholarly graph that was recently split out. (link)

We also want to test cost-effective options for scaling up our test environment, or for running brief, targeted larger-scale tests, that might better approximate a production environment rather than a desktop environment.

What is your solution to this problem (please explain the context and the solution)?

See below, under goals.

Project goals

Doubling Wikidata

Doubling Wikidata is a way to create a version of Wikidata that performs similarly to what is expected to happen when Wikidata doubles in size. The doubling is done in a synthetic manner, so the result will not be exactly like a larger Wikidata. As well, it is not known how Wikidata will grow in the future. The doubling can be done in multiple ways, resulting in harder or easier versions to query, that would likely bracket the difficulty of querying a larger Wikidata.

General doubling

One way to create a double Wikidata is by modifying the copy so that the original and the copy share no identifiers, neither items nor properties. Nearly all queries against this double Wikidata will either access only the original or only the copy, and almost certainly only be slightly slower than now.

Another way to create a double Wikidata is by modifying the copy so that the original and the copy share no identifiers for items - identifiers for properties are the same in both the original and the copy. Queries that are not anchored at an item will then get information from both copies, and be slower than now. Most queries, however, are anchored at an item and will still see information from only one of the copies.

A third way to create a double Wikidata is by modifying the copy so that the original and the copy share no identifiers for properties - identifiers for items are the same in both the original and the copy. Most queries, however, use both item and property identifiers and most of them will still see information from only one of the copies.

To prevent this effective division, the double Wikidata not only has to have some identifiers in common between its two halves, but there needs to be links between them. One way to achieve these links is to start with the item-separated double Wikidata, pick one or more properties, and re-double the links for the properties to connect an original item to a copy item or connect a copy item to an original item or both.

The ideal properties for this purpose are the ontology properties. Picking instance of (P31) results in all classes ending up with twice the number of instances. Any query that accesses instances of a class then becomes harder. Picking subclass of (P279) results in the subclass hierarchy being twice as large and also much more interconnected. Queries that look for subclasses of a class become harder. Picking both multiples the effect for queries that access instances of a class.

As more and more properties are re-doubled in the manner the double Wikidata becomes more interconnected, making more and more queries harder and harder. Re-doubling only a few of the large and prominent properties should result in a double Wikidata that is harder to work with than what Wikidata is expected to grow into.

Targeted doubling

It is also possible to target the duplication. A targetted duplication, for example, might only duplicate the instances of a few related classes and information connected to them, ideally in an area that is expected to grow rapidly. The size of the information duplicated would be only a fraction of the total size of Wikidata. This would allow investigation of Wikidatas whose size is only somewhat larger than the current Wikidata. It would also allow making multiple copies, perhaps as many as 10 or so, of the information that would allow investigating what could happen as this part of Wikidata becomes larger and larger while the rest of Wikidata remains more-or-less the same.

Targetted doubling can also take advantage of knowledge of the targetted classes to create doubled instances that are more like what would exist as Wikidata becomes larger. For example, some related classes could be considered to be complete and not be doubled. Or the property values in instances of a class could be doubled or not, depending on the property involved.

Details:

There are complications in creating the double Wikidatas that are not described above. For example, non-truthy statements have a complex structure. The precise way to correctly duplicate this structure needs to be determined.

The computer time taken to create and load a larger Wikidata should not be excessive, at least for fast engines like QLever. Modifying a copy of the Wikidata dump in simple ways would take a few hours. Loading the double Wikidata into QLever should only take slightly more than twice as much time as is required for the current Wikidata. The longest part of the loading process is sorting the indices, which should scale as n log n. As the current Wikidata can be loaded into QLever in under five hours, loading a double Wikidata into QLever should take no more than a day. The amount of main memory and SSD space in the current benchmark machine should be adequate as well.

The situation for MilleniumDB and Virtuoso will likely be similar. Their current load times are longer than QLever but they should be able to load a double Wikidata in a day or three. As MilleniumDB and Virtuoso are slower than QLever for querying, it might not be worthwhile to try them out unless there is some good reason to also check their performance. Loading multiple versions of double Wikidata into multiple engines will require extra storage for the indexes the engines use, possibly requiring additional disk space.

The situation is likely to be different for Blazegraph, as Blazegraph struggles with loading the current Wikidata. This indicates that any benchmarking of double Wikidata would exclude Blazegraph.

Benchmarking New Versions of QLever and MilleniumDB

Both QLever and MilleniumDB are undergoing development and new versions come out regularly. The benchmarking should be repeated as significant changes are made to the engines.

The benchmarking harness and statistical analysis tools are now capable of redoing the benchmarks with little human intervention. The effort to redo them is mostly in setting up the processes, waiting for them to complete, and analyzing the result.

Benchmarking With Sufficient Memory to Hold all the Data

One reason that engines slow down on complex queries is that they have to load data from secondary storage, and may have to do this over and over again. The previous benchmarking was done on a machine with 192GB of main memory not to eliminate this problem, but instead to allow lots of memory for internal calculations. But as a side effect, there was also considerable memory available that Linux could use as a disk cache for part, but not all, of the indices and other data needed to answer queries.

If there was enough main memory to hold all the data and also enough for internal caclulations, this might speed up query answering significantly. Buying hardware for just this purpose is not reasonable until it has been determined that there can be significant speedup. Instead a few benchmarking runs could be performed on a cloud machine to see what the speedup is. Cloud machines with the required amount of main memory (about 1TB) are not cheap to run, but a single run of the benchmarks from the previous project on QLever should cost about $300 on a 1TB machine.

Benchmarking Updates

QLever now has an update mechanism. There is now access to the update stream going into the WDQS. Benchmarking QLever on this stream would help determine whether QLever is suitable as a backend for the Wikidata Query Service.

Some code needs to written to turn the update stream into something suitable for QLever. A harness needs to be written to take the update stream and feed it into QLever. The behaviour of QLever under the update stream needs to be analyzed. The behaviour of QLever under a combined update stream and query stream needs to be analyzed.

Project impact

How will you know if you have met your goals?

Do you have any goals or metrics around participation or content?

Project plan

Activities

Budget

5000 – technical development of synthetic datasets for 2x WD and 10x Scholarly graph corpora
       installing newer versions of the top alternative backends; evaluation of benchmarks
       setting up virtual test environments of different sizes that anyone could use to run such tests,
          incl. environments with enough memory to hold the full database
2000 – testing how well each backend works with continuous updates from the stream of Wikidata edits
1500 – outreach to Wikidata devs and alternate backends to see how this suits their own evaluations
1500 – overhead + misc costs, incl. travel to one event to present on this and the previous work

Community engagement