Grants:Project/Future-proof WDQS

I will make a future-proof and much easier to scale WikiData.

Many aspects can be improved in WikiData, WikiBase, WDQS et al. In the following I try to make a comprehensive analysis of the current situation. Including some elements from the 2030 strategy, some of those recommendation are inspired from Denny Vrandečić essay called Toward An Abstract Wikipedia.

At its scale, Wikidata has reached the limits of what is possible to do with (legacy?) off-the-shelf software, efficiently in a future-proof way.

statuswithdrawn

Project Grants

Future-proof WikiData

summaryCreate a Minimum-Viable-Product for a future-proof WikiData

targetWikiData

type of granttools and software

amountplease add the amount you are requesting (USD)

type of applicantindividual

grantee• Iamamz3

contact• talk

this project needs...

volunteer

affiliate

grantee

give feedback

join

endorse

created on11:28, 23 December 2019 (UTC)

Friendly space expectations

Project idea

What is the problem you're trying to solve?

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

The following problem statements is split into 3 parts:

Why and How WDQS does not scale?
Why and How WikiData does not scale?
Why and How WikiData is not future-proof?

This section ends with a summary.

WikiDataQueryService does not scale

Quoting Guillaume Lederrey Operations Engineer — Search Platform, Wikimedia Foundation in the wikidata mailling thread "Scaling Wikidata Query Service":

In an ideal world, WDQS should: scale in terms of data size scale in terms of number of edits have low update latency expose a SPARQL endpoint for queries allow anyone to run any queries on the public WDQS endpoint provide great query performance provide a high level of availability Scaling graph databases is a "known hard problem", and we are reaching a scale where there are no obvious easy solutions to address all the above constraints. At this point, just "throwing hardware at the problem" is not an option anymore. We need to go deeper into the details and potentially make major changes to the current architecture.

I want to add the requirement that is shall be easy for researchers and practitioners to setup their own instance which would be another form of scaling that is social which entails making it easier to use wikidata.

The current solution adopted to support WDQS involves BlazeGraph. BlazeGraph is not really maintained because the developers were hired by Amazon. Wikimedia could just invest more in BlazeGraph maintenance (see the commits of Stas Malyshev Software Engineer, Wikimedia Foundation on BlazeGraph repository.). Since sharding is not realistic because of the schema of wikidata and because performance would not be good anyway: Blazegraph scale only using the vertical strategy using replicas (copies). Vertical scaling hits the limitations of the hardware, and eventually of hardware and physics: in the foreseeable future there is only so much one can store inside a single machine box.

Here is breakdown of the current solution involving blazegraph try to scale WDQS:

Blazegraph approach to scaling WDQS
#	requirements	strategy	limitation
1	Scale in terms of data size	vertical scaling: bigger hard disks	Physical, due to available hardware technology.
2	Scale in terms of edits	vertical scaling: faster cpu (and larger network bandwidth)	Physical, due to available hardware technology. The entailed limitations are linked to the vertical scaling strategy, lead to the existence of a "lag" between WikiBase and WDQS, see the following row.
3	Lag: have low update latency	vertical scaling: faster cpu (and larger network bandwidth)	Physical, due to available hardware technology. There shall be no lag: it is also a problem of software.
4	Expose a SPARQL endpoint for queries	translation middleware in front of blazegraph, see https://github.com/wikimedia/wikidata-query-rdf	Operations are made more difficult because there is a lot of services and moving parts.
5	Allow anyone to run any queries on the public ~~WDQS endpoint~~ wikidata triples	WDQS	No time-traveling queries, Lag (see row 2), Many queries timeout,
6	Provide great query performance	vertical scaling and replicas	Operations are more difficult.
7	Provide a high level of availability	replicas	Operations are more difficult.
8	Easy to setup and operate	docker-compose or kubernetes	Requires more skills.

WikiData does not scale

The previous section describes several reasons why a specific component of wikidata infrastructure is not future-proof. WikiDataQueryService rely on vertical scaling, hence the availability of performant and efficient hardware that is possibly costly. The consequence of the limitations of Blazegraph software, hence WDQS, is that WikiData is difficult to:

setup and reproduce,
develop and maintain,
operate and scale.

Along those three dimensions, taking a look at the bigger picture that is wikidata project, draws a situation that is worse:

WikiData Problems
#	Topic	Problem	Effect
0	setup and reproducibility	Too many independent processes and code bases (microservices)	Less contributions
1	setup and reproducibility	Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes	Less contributions
2	setup and reproducibility	Production environment setup requires skills with Kubernets or Puppet	Less contributions
3	development and maintenance	MediaWiki: PHP and JavaScript code base with a lot of legacy code	Less contributions
4	development and maintenance	WikiBase: PHP and JavaScript code base	Less contributions
5	development and maintenance	Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...)	Less contributions
6	operate and scale	Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch)	Less contributions
7	operate and scale	Impossible to do time travelling queries	Less contributions
8	operate and scale	See section "WikiDataQueryService does not scale"	Less contributions
9	operate and scale	Edit than spans multiple items	Less contributions

Because WikiData is difficult to scale, Wikimedia fails to fully enable and empower users, according to its mission:

Wikimedia mission
"The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally."
https://wikimediafoundation.org/about/mission/

Why and How WikiData is not future-proof?

In the two previous sections, two components were analyzed and shed light on some existing problems problems. This section will try to extract from existing publication, possible problems that wikidata shall need to tackle in the future.

Toward an abstract wikipedia

http://simia.net/download/abstractwikipedia_whitepaper.pdf

Wikimedia movement strategy toward 2030

Strategy/Wikimedia movement/2018-20/Recommendations

Summary

Big Picture
#	Problem	Time scale	Why	Effect
1	WDQS is not scalable	present	Legacy off-the-shelf software that lead the project to be neither maintainable, nor scalable.	WikiData is not scalable.
2	At WikiData scale, No Usable Versioned Triple Store.	immediate future	No use-case for such a software until now.	No time-traveling queries, No change request mechanic, ⇒ Cooperation around the creation and maintenance of structured data is painful.
3	WikiData is not scalable	immediate future	WikiData based on MediaWiki, Horizontal scaling is an essential complexity.	Less code and data contributions.
4	Trusted knowledge as a service is difficult	immediate future	Software and software development does not scale easily Communities do not scale easily Search and discovery is still difficult	Less knowledge equity
5	No Abstract Wikipedia	future	No existing open-source software Not enough interests in existing scientific contributions	Less knowledge equity Unrealistic goal for the foreseeable future.
6	Earth scale encyclopedic knowledge	future	Internet access is still not global Current architecture and cooperation paradigm is not scalable	Need to continue to rethink the current architecture Need to continue to explore cooperation mechanics Need to continue to explore distribution mechanics

What is your solution?

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

Only the first three problems of the Big Picture will be addressed:

Scalable WikiDataQueryService
Scalable Versioned Triple Store
Scalable Wikidata

How to make WikiData scalable?

The summary of the solution is:

to drop legacy reduce the operational costs,
reduce the learning-curve to ease the on-boarding of new developers,
scale wikidata, including SPARQL queries,
add a new features: time traveling queries and change-request.

The following table describes proposed solutions to existing problems in WikiData:

Proposed Solution to WikiData Problems
#	Topic	Problem	Solution	Effect
0	setup and reproducibility	Too many independent processes and code bases (microservices)	A single executable A single process (no microservices, outside the possibly horizontally scalable distributed database) A single code base	Easy to setup, hence reproduce, Less operational work, Code is easier to make sense and navigate, ⇒ Easier to contribute code
1	setup and reproducibility	Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes	Avoid docker, docker-compose and Kubernetes to setup the coding environment. This is made possible because the solution is not based on microservices.	Requires less coding and operational skills, ⇒ Easier to contribute code.
2	setup and reproducibility	Production environment setup requires skills with Kubernets or Puppet	Vertical scaling: the database is embedded in the "single process" Horizontal scaling: Multiple machine deployment require something like Kubernetes or Puppet, so in some sense this problem is minor and not addressed.	Easy single process production setup, Kubernetes or Puppet will still be required for production environment to support horizontal scalability.
3	development and maintenance	MediaWiki: PHP and JavaScript code base with a lot of legacy code	Do not rely on MediaWiki, Do not rely on PHP or JavaScript, Do not rely on the existing approach where "structured data" is an afterthought.	No legacy code, Rethink solution to match the problem, Clean solution, ⇒ Easier to contribute code.
4	development and maintenance	WikiBase: PHP and JavaScript code base	Rewrite WikiBase from scratch with Scheme to match the problem.	Easier to code and maintain, Faster code, ⇒ Easier to contribute code.
5	development and maintenance	Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...)	Do not rely on PHP, Javascript, Lua, Go and Java, Rely on Scheme, Rely on project maintained by third parties when other programming languages are required.	Reduce the number of languages in the stack and under the responsability of wikimedia, Dependance on third-parties, Faster code, ⇒ Easier to contribute code.
6	operate and scale	Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch)	Rely only on, possibly in-memory, possibly distributed, Ordered Key-Value Store.	Less database expert knowledge required, Easier operations, Faster, Scalable, ⇒ Easier to contribute code, ⇒ Easier to contribute data.
7	operate and scale	Impossible to do time travelling queries	Use Generic Tuple Store to implement a Versioned Generic Tuple Store.	⇒ Easier to contribute data.
8	operate and scale	See section "WikiDataQueryService does not scale"	Rely only on, possibly in-memory, possibly distributed, Ordered Key-Value Store.	⇒ Easier to contribute code, ⇒ Easier to contribute data.
9	operate and scale	Edit that spans multiple items	Change request mechanic allow to add, delete and undo changes	⇒ Easier to contribute data. ⇒ Allow one-click undo of merges and QuickStatement batches.

What are other solutions?

virtuoso-opensource

github: https://github.com/openlink/virtuoso-opensource/

Pros

Similar existing deployment
Supported by an experienced company

Cons

monopoly
vendor lock-in
no support for time-traveling queries
no support for change-request
not a complete solution
AS OF YET, no jespen.io database harness tests?
MAYBE not complete ACID guarantees?

Property graph databases

See https://github.com/jbmusso/awesome-graph/#awesome-graph

Pros

MAYBE similar existing deployment but certainly not in the open
Supported by established company (neo4j, dgraph, arangodb), in the case of JanuGraph, it is supported by the Linux Foundation.

Cons

does not map efficiently to RDF triples
no support time-traveling queries
no support for change-request
not a complete solution
AS OF YET, no jespen.io database harness tests (neo4j, dgraph, arangodb)

Other triple stores

github: https://github.com/semantalytics/awesome-semantic-web#databases

Pros

?

Cons

no support for time traveling queries
no support for change-request
not complete solution
AS OF YET, no jespen.io database harness tests?

Project goals

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

The goal of the project is to support WikiData growth in terms of:

code contributions,
data contributions.

Toward that goal, the project must be:

easy to setup, reproduce, code and maintain the code,
faster, allow time-traveling queries and provide a way to visual edition triples,
both vertically and horizontally scalable.

From this project will emerge a clear architecture toward a scalable wikidata.

Project impact

How will you know if you have met your goals?

For each of your goals, we’d like you to answer the following questions:

During your project, what will you do to achieve this goal? (These are your outputs.)
Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

Outputs

The github repository of the code,
Two or three contributors to the project,
Positive benchmark results based on https://iccl.inf.tu-dresden.de/web/Wissensbasierte_Systeme/WikidataSPARQL/en,
One published paper on wikijournal about the solution,
One or two organizations outside wikimedia start using the project.

Outcomes

More people outside wikimedia use the project to host wikidata or wikidata-like projects
The current stack / architecture is replaced with the result of this project
More people contribute to wikidata
wikidata doubles the number of triples to reach 20 billions

Do you have any goals around participation or content?

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.

The project will improve performance and availability of WikiData.

Project plan

Activities

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

Plan
Quarter	Title	Activitiy	Guesstimate	Output
1	Arew	Finish work on Arew 1.0.0 Tidy the standard library Finish SRFI-180 (JSON) Submit untangle to SRFI Submit HTTP/1.1 to SRFI	1 month	Efficient R7RS-large implementation on top of Chez Scheme
1	Ruse	Finish work on Ruse 1.0.0 Implement closure conversion Implement JQuery bindings Finish ReactJS bindings	1 month	R7RS-small based on nanopass framework that runs in the browsers
1	nomunofu 0.2.0	Finish work on nomunofu 0.2.0 Tidy nstore Add JSON over HTTP query service	1 month	JSON query via REST API WiredTiger micro-benchmarks
2	nomunofu 0.3.0	Adapt FoundationDB bindings to support SRFI-167	1 month	FoundationDB micro-benchmarks
2	nomunofu 0.4.0	Tidy versioned nstore Initial SPARQL REST with change-request support: select insert update delete	1 month	RDF conformance tests^[1] time-traveling queries support via REST API SPARQL via REST API SPARQL benchmarks
2	nomunofu 0.5.0	Users and permissions Visual edition of tuples of n items Visual change-request mechanic Create Apply Revert History of changes via REST API	1 month	Web-based graphical user interface Stream of changes via REST API
3	nomunofu 0.6.0	OAUTH 2.0 REST API improvements	1 month	Robot access
3	nomunofu 0.7.0	Autocomplete Search Spell-checking	1 month	Better usability of visual edition of tuple of n items
3	nomunofu 0.8.0	SPARQL optimizations, fine tunings, and benchmarks	1 month	SPARQL micro-benchmarks with WiredTiger SPARQL micro-benchmarks with FoundationDB
4	nomunofu 0.9.0	Visual editor I18n, a11y, and ui/ux review and improvements	1 month	I18n visual editor a11y visual editor
4	nomunofu 0.9.9	More SPARQL optimizations, fine tunings, and benchmarks	1 month	FoundationDB cluster tips, tricks and recommendations WiredTiger configuration tips, tricks and recommendations Full benchmarks^[2]
4	nomunofu 1.0.0	Bug fixes Tidy WikiJournal publication	1 month	Minimum Viable Product Code Tests Documentation WikiJournal Publication

Budget

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Budget will be set when we agree on a plan. The rough estimate is between 2500-5500 euros per month depending on applicable taxes, possibly plus the cost the rent hardware to do the benchmarks (see https://phabricator.wikimedia.org/T206636).

Community engagement

Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?

I will continue to blog about my project at https://hyper.dev (currently offline, I will prolly move my blog to a mailling list at source hut) with weekly, bi-weekly and monthly review of my progress and engage with the community on the wiki spaces, mailing lists, and IRC,
I will publish a paper on wiki journal,
I expect input from the community regarding accessibility, usability and help regarding localization
I also wait for more information regarding the availability of hardware, see https://phabricator.wikimedia.org/T206636

References List

Get involved

Participants

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

I am amz3 also known as zig on freenode. I have been a software engineer in various domain for 10 years (bitbucket, github, sourcehut). I would like to join wikimedia.

Community notification

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

Draft pre-print at https://en.wikiversity.org/wiki/WikiJournal_Preprints/Generic_Tuple_Store
wikidata-tech mailing list https://lists.wikimedia.org/pipermail/wikidata-tech/2019-December/001511.html
another mail to wikidata: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013124.html
discuss-space @ wmflabs
https://www.wikidata.org/wiki/Wikidata:Project_chat#Scaling_WDQS
https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Scaling_WDQS_and_WikiData

Endorsements

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

[1] ttps://github.com/w3c/rdf-tests/

[2] ttps://iccl.inf.tu-dresden.de/web/Wissensbasierte_Systeme/WikidataSPARQL/en

[1]

[2]