Grants:Project/Future-proof WDQS
I will make a future-proof and much easier to scale WikiData.
Many aspects can be improved in WikiData, WikiBase, WDQS et al. In the following I try to make a comprehensive analysis of the current situation. Including some elements from the 2030 strategy, some of those recommendation are inspired from Denny Vrandečić essay called Toward An Abstract Wikipedia.
At its scale, Wikidata has reached the limits of what is possible to do with (legacy?) off-the-shelf software, efficiently in a future-proof way.
Project idea
[edit]What is the problem you're trying to solve?
[edit]What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.
The following problem statements is split into 3 parts:
- Why and How WDQS does not scale?
- Why and How WikiData does not scale?
- Why and How WikiData is not future-proof?
This section ends with a summary.
WikiDataQueryService does not scale
[edit]Quoting Guillaume Lederrey Operations Engineer — Search Platform, Wikimedia Foundation in the wikidata mailling thread "Scaling Wikidata Query Service":
In an ideal world, WDQS should:
Scaling graph databases is a "known hard problem", and we are reaching a scale where there are no obvious easy solutions to address all the above constraints. At this point, just "throwing hardware at the problem" is not an option anymore. We need to go deeper into the details and potentially make major changes to the current architecture. |
I want to add the requirement that is shall be easy for researchers and practitioners to setup their own instance which would be another form of scaling that is social which entails making it easier to use wikidata.
The current solution adopted to support WDQS involves BlazeGraph. BlazeGraph is not really maintained because the developers were hired by Amazon. Wikimedia could just invest more in BlazeGraph maintenance (see the commits of Stas Malyshev Software Engineer, Wikimedia Foundation on BlazeGraph repository.). Since sharding is not realistic because of the schema of wikidata and because performance would not be good anyway: Blazegraph scale only using the vertical strategy using replicas (copies). Vertical scaling hits the limitations of the hardware, and eventually of hardware and physics: in the foreseeable future there is only so much one can store inside a single machine box.
Here is breakdown of the current solution involving blazegraph try to scale WDQS:
# | requirements | strategy | limitation |
---|---|---|---|
1 | Scale in terms of data size | vertical scaling: bigger hard disks | Physical, due to available hardware technology. |
2 | Scale in terms of edits | vertical scaling: faster cpu (and larger network bandwidth) | Physical, due to available hardware technology. The entailed limitations are linked to the vertical scaling strategy, lead to the existence of a "lag" between WikiBase and WDQS, see the following row. |
3 | Lag: have low update latency | vertical scaling: faster cpu (and larger network bandwidth) | Physical, due to available hardware technology. There shall be no lag: it is also a problem of software. |
4 | Expose a SPARQL endpoint for queries | translation middleware in front of blazegraph, see https://github.com/wikimedia/wikidata-query-rdf | Operations are made more difficult because there is a lot of services and moving parts. |
5 | Allow anyone to run any queries on the public |
WDQS |
|
6 | Provide great query performance | vertical scaling and replicas | Operations are more difficult. |
7 | Provide a high level of availability | replicas | Operations are more difficult. |
8 | Easy to setup and operate | docker-compose or kubernetes | Requires more skills. |
WikiData does not scale
[edit]The previous section describes several reasons why a specific component of wikidata infrastructure is not future-proof. WikiDataQueryService rely on vertical scaling, hence the availability of performant and efficient hardware that is possibly costly. The consequence of the limitations of Blazegraph software, hence WDQS, is that WikiData is difficult to:
- setup and reproduce,
- develop and maintain,
- operate and scale.
Along those three dimensions, taking a look at the bigger picture that is wikidata project, draws a situation that is worse:
# | Topic | Problem | Effect |
---|---|---|---|
0 | setup and reproducibility | Too many independent processes and code bases (microservices) | Less contributions |
1 | setup and reproducibility | Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes | Less contributions |
2 | setup and reproducibility | Production environment setup requires skills with Kubernets or Puppet | Less contributions |
3 | development and maintenance | MediaWiki: PHP and JavaScript code base with a lot of legacy code | Less contributions |
4 | development and maintenance | WikiBase: PHP and JavaScript code base | Less contributions |
5 | development and maintenance | Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...) | Less contributions |
6 | operate and scale | Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch) | Less contributions |
7 | operate and scale | Impossible to do time travelling queries | Less contributions |
8 | operate and scale | See section "WikiDataQueryService does not scale" | Less contributions |
9 | operate and scale | Edit than spans multiple items | Less contributions |
Because WikiData is difficult to scale, Wikimedia fails to fully enable and empower users, according to its mission:
"The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally." |
https://wikimediafoundation.org/about/mission/ |
Why and How WikiData is not future-proof?
[edit]In the two previous sections, two components were analyzed and shed light on some existing problems problems. This section will try to extract from existing publication, possible problems that wikidata shall need to tackle in the future.
Toward an abstract wikipedia
[edit]http://simia.net/download/abstractwikipedia_whitepaper.pdf
Wikimedia movement strategy toward 2030
[edit]Strategy/Wikimedia movement/2018-20/Recommendations
Summary
[edit]# | Problem | Time scale | Why | Effect |
---|---|---|---|---|
1 | WDQS is not scalable | present |
|
|
2 | At WikiData scale, No Usable Versioned Triple Store. | immediate future |
|
|
3 | WikiData is not scalable | immediate future |
|
|
4 | Trusted knowledge as a service is difficult | immediate future |
|
|
5 | No Abstract Wikipedia | future |
|
|
6 | Earth scale encyclopedic knowledge | future |
|
|
What is your solution?
[edit]For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.
Only the first three problems of the Big Picture will be addressed:
- Scalable WikiDataQueryService
- Scalable Versioned Triple Store
- Scalable Wikidata
How to make WikiData scalable?
[edit]The summary of the solution is:
- to drop legacy reduce the operational costs,
- reduce the learning-curve to ease the on-boarding of new developers,
- scale wikidata, including SPARQL queries,
- add a new features: time traveling queries and change-request.
The following table describes proposed solutions to existing problems in WikiData:
# | Topic | Problem | Solution | Effect |
---|---|---|---|---|
0 | setup and reproducibility | Too many independent processes and code bases (microservices) |
|
|
1 | setup and reproducibility | Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes |
|
|
2 | setup and reproducibility | Production environment setup requires skills with Kubernets or Puppet |
|
|
3 | development and maintenance | MediaWiki: PHP and JavaScript code base with a lot of legacy code |
|
|
4 | development and maintenance | WikiBase: PHP and JavaScript code base |
|
|
5 | development and maintenance | Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...) |
|
|
6 | operate and scale | Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch) |
|
|
7 | operate and scale | Impossible to do time travelling queries |
|
|
8 | operate and scale | See section "WikiDataQueryService does not scale" |
|
|
9 | operate and scale | Edit that spans multiple items |
|
|
What are other solutions?
[edit]virtuoso-opensource
[edit]github: https://github.com/openlink/virtuoso-opensource/
Pros
[edit]- Similar existing deployment
- Supported by an experienced company
Cons
[edit]- monopoly
- vendor lock-in
- no support for time-traveling queries
- no support for change-request
- not a complete solution
- AS OF YET, no jespen.io database harness tests?
- MAYBE not complete ACID guarantees?
Property graph databases
[edit]See https://github.com/jbmusso/awesome-graph/#awesome-graph
Pros
[edit]- MAYBE similar existing deployment but certainly not in the open
- Supported by established company (neo4j, dgraph, arangodb), in the case of JanuGraph, it is supported by the Linux Foundation.
Cons
[edit]- does not map efficiently to RDF triples
- no support time-traveling queries
- no support for change-request
- not a complete solution
- AS OF YET, no jespen.io database harness tests (neo4j, dgraph, arangodb)
Other triple stores
[edit]github: https://github.com/semantalytics/awesome-semantic-web#databases
Pros
[edit]- ?
Cons
[edit]- no support for time traveling queries
- no support for change-request
- not complete solution
- AS OF YET, no jespen.io database harness tests?
Project goals
[edit]What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.
The goal of the project is to support WikiData growth in terms of:
- code contributions,
- data contributions.
Toward that goal, the project must be:
- easy to setup, reproduce, code and maintain the code,
- faster, allow time-traveling queries and provide a way to visual edition triples,
- both vertically and horizontally scalable.
From this project will emerge a clear architecture toward a scalable wikidata.
Project impact
[edit]How will you know if you have met your goals?
[edit]For each of your goals, we’d like you to answer the following questions:
- During your project, what will you do to achieve this goal? (These are your outputs.)
- Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.
Outputs
[edit]- The github repository of the code,
- Two or three contributors to the project,
- Positive benchmark results based on https://iccl.inf.tu-dresden.de/web/Wissensbasierte_Systeme/WikidataSPARQL/en,
- One published paper on wikijournal about the solution,
- One or two organizations outside wikimedia start using the project.
Outcomes
[edit]- More people outside wikimedia use the project to host wikidata or wikidata-like projects
- The current stack / architecture is replaced with the result of this project
- More people contribute to wikidata
- wikidata doubles the number of triples to reach 20 billions
Do you have any goals around participation or content?
[edit]Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.
The project will improve performance and availability of WikiData.
Project plan
[edit]Activities
[edit]Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
Quarter | Title | Activitiy | Guesstimate | Output |
---|---|---|---|---|
1 | Arew |
|
1 month |
|
1 | Ruse |
|
1 month |
|
1 | nomunofu 0.2.0 |
|
1 month |
|
2 | nomunofu 0.3.0 |
|
1 month |
|
2 | nomunofu 0.4.0 |
|
1 month |
|
2 | nomunofu 0.5.0 |
|
1 month |
|
3 | nomunofu 0.6.0 |
|
1 month |
|
3 | nomunofu 0.7.0 |
|
1 month |
|
3 | nomunofu 0.8.0 |
|
1 month |
|
4 | nomunofu 0.9.0 |
|
1 month |
|
4 | nomunofu 0.9.9 |
|
1 month |
|
4 | nomunofu 1.0.0 |
|
1 month |
|
Budget
[edit]How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
Budget will be set when we agree on a plan. The rough estimate is between 2500-5500 euros per month depending on applicable taxes, possibly plus the cost the rent hardware to do the benchmarks (see https://phabricator.wikimedia.org/T206636).
Community engagement
[edit]Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?
- I will continue to blog about my project at https://hyper.dev (currently offline, I will prolly move my blog to a mailling list at source hut) with weekly, bi-weekly and monthly review of my progress and engage with the community on the wiki spaces, mailing lists, and IRC,
- I will publish a paper on wiki journal,
- I expect input from the community regarding accessibility, usability and help regarding localization
- I also wait for more information regarding the availability of hardware, see https://phabricator.wikimedia.org/T206636
References List
[edit]Get involved
[edit]Participants
[edit]Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
I am amz3 also known as zig on freenode. I have been a software engineer in various domain for 10 years (bitbucket, github, sourcehut). I would like to join wikimedia.
Community notification
[edit]Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?
- Draft pre-print at https://en.wikiversity.org/wiki/WikiJournal_Preprints/Generic_Tuple_Store
- wikidata-tech mailing list https://lists.wikimedia.org/pipermail/wikidata-tech/2019-December/001511.html
- another mail to wikidata: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013124.html
- discuss-space @ wmflabs
- https://www.wikidata.org/wiki/Wikidata:Project_chat#Scaling_WDQS
- https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Scaling_WDQS_and_WikiData
Endorsements
[edit]Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).