Wikimedia monthly activities meetings/Quarterly reviews/Parsoid/October 2014
The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Parsoid, Services and OCG (Offline content generator) teams, October 3, 2014, 12:30PM - 2:00PM PDT.
- Attending
Present (in the office): Gabriel Wicke, Damon Sicore, Jared Zimmerman, Tomasz Finc, Tilman Bayer (taking minutes), James Forrester, Toby Negrin, Erik Moeller, Trevor Parscal, Terry Chay; participating remotely: Subramanya "Subbu" Sastry, Marc Ordinas i Llopis, C. Scott Ananian, Roan Kattouw, Arlo Breault, Hardik Juneja (from 1:40)
Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material
Parsoid
[edit]
Subbu:
Welcome

[slide 2]
Parsoid team: core and extended
help from others
teams that use Parsoid: also CX

[slide 3]
Agenda

[slide 4]
our objectives

[slide 5]
development context
came a long way since 2011-12

[slide 6]
why this is hard (skipped)

[slide 7]
Progress Q1
strive for similarity with [old] PHP parser's rendering
robustness: handle pathological cases
testing: compare HTML from PHP parser and Parsoid
visual diffing [compare rendered HTML pixel by pixel]
not implemented in Q1:
language variant support - didn't have Scott

[slide 8]
Continuous iteration
lots of edge cases

[slide 9]
Parser tests
tests run on every commit

[slide 10]
Round-trip test results
on 160k pages
now 0.16% - good progress
That's still 7k pages on ENWP though
Gabriel: but .. normally hides even these completely
with that, only 7 pages in that 160k have discrepancies, and only really small ones (a extraneous newline or such)

[slide 11]
visual diffs

[slide 12]
visual diffs http://parsoid-tests.wikimedia.org/visualdiff/
repurposed testing infrastructure for visual diffs
on subset, because it's still quite expensive
Damon: why pixel-perfect accuracy?
(team: it's for detecting larger problems[?])
Trevor: differences come from use of, say, div vs. <image> tag[?]
Parsoid specific CSS to achieve that
Jared: template rendering?
Trevor: mostly images
[slide' example: ...tournament (demos diff)]
Trevor: e.g. wrapping in paragraph vs. not
Damon: testing on different browsers? just Webkit
Gabriel: here, cross-browser differences aren't much of an issue usually
CScott: hoping I can reuse some of that for PDF rendering testing
Toby: scope of this testing?
Gabriel: currently only desktop, mobile might be next
Damon: (on relation with language support)
Damon: what would be the next thing you would do for testing if you had the resourcces to do something cool? (liked CScott's PDF idea)
Subbu: not sure
.. all langs
but visual diffing only for enwiki currently
CScott: this is most important to catch regressions
Damon: any HTML5 elements not covered by this?
Subbu: audio and video not supported yet

[slide 14]
Preparing for HTML5 page views

[slide 15]
still need to fix site CSS
e.g. citations [on enwiki]
can use gadget (en:User:Jackmcbarn/parsoidview.js) to check how Parsoid HTML looks like in production

[slide 16]
remove mw-data...
also to enable asynchronous savse

[slide 17]
make it more robust

[slide 18]
more robust
fixes after huwiki table incident last month

[slide 19]
performance
(shows load/memory graphs)

[slide 20]
arwiki/enwiki/dewiki parsing times in seconds
Gabriel: other direction (HTML -> wiki conversion) is much faster, 100ms avg

[slide 21]
(Subbu:)
Other
Gabriel: collaboration with WikiProject Check Wikipedia [to fix corner cases]
Damon: looked into fuzzing?
Gabriel: basically we use production data for fuzzing, but not randomly generated input
Subbu: no
[someone else:] we have real people fuzzing us ;)
Damon: fuzzing could help find issues like the large table issue on huwiki
CScot: last crash issue was a while ago
lot of the remaining issues are bugs from PHP parser [emulate or not]
Damon: any security issues?
JamesF: use it on some private wikis, but not an issue
Trevor: because we work on a DOM, escaping is more likely to be done right
moving away from string stuff with e.g. templates
Erik: things get scary with user-created templates
in wiki-markup, this is combination of wiki templates and Lua, both well sanitized[?]
looking into better ways for templates
Gabriel: (on sanitizing)
we tracked all PHP security issues
Damon: sounds good

[slide 22]
Subbu: Wikitext linter GSOC project by Hardik Juneja
[examples: [1]]
(slide 23: PDF renderer - see CScott's part below)
(slide 24: Things we didn’t get done in Q1)

[slide 25]
Broad areas for Q2

[slide 26]
Parsoid HTML views
cite CSS (Marc works on this)
HTML Tidy vs. HTML5 differences: matter in case of broken wikitext
mixed content style templates (e.g. one template opens a tag, another closes it - doesn't match with DOM)

[slide 27]
Q2 tasks: supporting clients
language variants supports

[slide 28]
stable IDs (for elements, persisting across wikitext edits) - e.g. for authorship maps, inline comments, switching from VE to wikitext editor

[slide 29]
new applications
Linttrap...

[slide 30]
perf + maintenance
Thank you! (end of Parsoid slides)
Damon: do we have specific perf goals? like "<20millisec for ..."
Gabriel: has not been an issue so far because it's not [directly affecting users]
Damon: this looks great, I'm still learning... generally, I like to ask about: testing, defining what winning means
Erik: make Parsoid HTML view gadget into Beta Feature?
Subbu: in a few weeks
Erik: any really horrible examples left?
Subbu: some templates, infoboxes and such, would look bad
Erik: do we need cross lang visual diffing before that?
JamesF: cite extension's styling in hack
Parsoid output is munged to look like enwiki for that
frwiki doesn't have [ ] around refs [but in Parsoid HTML they would look like on enwiki)
Erik: ...
CScott: currently using PhantomJS, old, might not translate well to e.g. non-latin languages
Erik: so let's restrict Beta feature to latin languages
Roan: Timo worked on unit tests
CScott: PhantomJS might have new release [which fixes that]
Services
[edit]
Gabriel:
started at beginning of [Q1]

[slide 2]
had discussion in January, resolved to move to services model
scale storage - some parts like external store have been band-aid, hitting limits

[slide 3]
(perf graphs)
PHP API, uncached, we have to ask people not to do expensive things
can tie up one instance [for a longer time]
but is powerful for editing

[slide 4]
most of our app servers are busy processing cache misses
edit rate relatively low (~15/s avg, 50/s peak across all projects)
24 Parsoid boxes can keep up with updates (but use PHP API)
Damon: that's enwiki? no, all projects
Toby: 95% or 99%?
Gabriel: might be >95, yes
we have a long history of caching
it was a successful strategy
Trevor: but doesn't work well for people logging in
more strategy for personalization
Damon: ...
Erik: might shift some of that [personalization] to client, with async [modification]
[personalization is] hard with any caching
JS is a good mechanism for UI customization
example:...
Gabriel: also, stable IDs enable some customization
Roan: also, logged-in users always get data from Virginia instead of from caching center that's closest to them
Damon: every app server is on the same cluster? yes
Erik: Dallas is almost exact replica of Virginia, aim for failover of app server cluster
but there's interest in using it for load balancing etc. too
Damon: are we deploying with an eye to isolation and anonymization?
hearing Virginia is a bit distressing...
(discussion about hosting)
Gabriel:
distribution means some security challenges. don't trust the code; have had remote code execution & SQL injection bugs... e.g. should image scaling servers have full access to the user table?

[slide 5]
Resourcing: Matt left in July
Hardik
hiring

[slide 6]
Original plan for Q1

[slide 7]
RestBase

[slide 8]
misc backend services
Erik: do we have citoid.wikimedia.org now? no, it's internal
Roan (on chat): Citoid isn't deployed yet, Alex -1ed my patch. But it is deployed in labs and getting it deployed in prod is just a question of me splitting up my patch into 3 parts to make things less scary for Alex
(Gabriel:)
Mathoid:
JamesF: had bit written in OCaml, scary!
CScott: I actually removed it recently
Gabriel: working with MathJax people
Damon: so we use Latex how?
Gabriel: it's the source, within wikitext
Erik: has been supported in wikitext for a long time, used to generated PNG
issue: client-side latency
they (volunteers) implemented server-side version of MathJax
Gabriel: open question: who owns MediaWiki integration?
Erik: how about you ;)
Gabriel: Need front-end expertise.
Damon: we should support the math folks
--break for hydrant breakage--

[slide 10]
Gabriel:
templating: Matt and myself, make DOM-based templating with context-sensitive escaping fast
idea is to complement existing reactive client-side solution
Knockoff: fast on server side; complements KnockoutJS on client
well suited for server-side pre-rendering followed by client-side dynamic updates
did benchmarks across languages, it's fast
front-end standardization group now working on this

[slide 11]
Q2
Restbase, hiring

[slide 12]
Q2 RESTbase
Erik: is this the first use of Swagger? yes
did you consider this for RC stream too? yes
Erik: does Moiz know?
Gabriel: yes
Wikia uses it too
build generic & basic monitoring for backend services (latency, errors)
new entry points, not systematically exposed in API currently
variant production on edits (produce offline and store)
need reliable queueing / events
for purging, etc.
Damon: is there also unreliable ...
Gabriel:
we have a homegrown job queue system, tied closely to PHP
forces you to write your own job runner, doesn't let you do reliable pub/sub
was in bash, recently rewritten in PHP
Analytics team has a lot of experience with Kafka, might use that

[slide 13]
Q2: perf
Toby: this is on SSDs?
Gabriel: will deploy to SSDs, when it matters
Cassandra
eliminate cache misses
for Parsoid, also e.g. old version
VE fast saves - that's mostly in VE land
JamesF: we have some ideas ;)
Gabriel: send HTML, but don't wait for rendering to come back, instead use HTML you already have
logged in page views faster
just some customization, e.g. links underlined

[slide 13]
security
for most read requests, don't need to go to database for auth
first phase implementation in cookies
also consumed by MediaWiki
Erik: has Wikia done any work on auth?
Gabriel: yes, recently talked with them

[slide 15]
Tooling
takes several days per service, lots of bugs in deployed versions of salt & trebuchet
duplication of init scripts etc. between packages and different puppet setups
should see with RelEng & Ops teams if we can streamline the process for services
OCG (PDF rendering)
[edit]
CScott: OCG (Offline Content Generator)

[slide 3]
Global South

[slide 4]
history

[slide 5]
mwlib: not well maintained
lot of bugs, especially non-latin

[slide 6]
2014 so far
in production this Monday

[slide 7]
Q2 goals
bus factor 1.5
tables and infoboxes missing
Indic languages
had to turn off ePub and ZIM
but already have plaintext as beta

[slide 8]
next-gen PDF renderer
current version of PhantomJS outdated
Print CSS?

[slide 9]

[slide 10]
in production now, ~110k req / day

[slide 11]
cache hit rate still only 25%
Thursday: Indic languages turned on

[slide 12]
gaph
cache clearing
Erik: let's sync up on roadmap, defining what we can provide and what will need to be taken up by community
suggest base commitment: reasonable rendering quality for all, extra effort for some large services like Kiwix, rely on volunteers for the rest
CScott: all my work recently was on production issues