Grants:Project/Ilya/ScalaWiki data processing toolbox
Moved to draft by the author --Ilya (talk) 15:35, 24 August 2016 (UTC) |
Project idea
[edit]What is the problem you're trying to solve?
[edit]Absence of universal toolbox to define and execute processing of Wikipedia related data.
There are some basic ways to retrieve data
- Using MediaWiki API to query about 30 lists and 20 properties (plus extensions and other services like Pageview API)
- Using SQL queries on database replicas (without actual article texts)
- Download and process data dumps
- Process Wikidata via its API, SPARQL Wikidata query service or dumps
- Extract data from parsed page wikitext pages especially tables and templates
While theoretically all of these can provide access to any data present on Wikimedia projects, in practice users need to write scripts or tools for data access.
Tool developer going up the abstraction level from mentioned data sources will see a large amount of client libraries for processing MediaWiki API, tools to import or process dumps etc. (incomplete list).
Many of them provide set of functionality limited to a varying degree comparing to the lower level API, lack of documentation and often the code coverage with automated tests is not enough to protect from various bugs. You may find that features you'd like to use or adapt are only achievable with a combination of many tools all in different programming languages/styles and quality level.
On top of this for end users there are many different tools. One example of such topical list of tools is GLAM tools list. Many tools are no longer working or are unsupported. They usually provide fixed and hardcoded functionality so there are many requests for more tools.
You cannot easily combine data from various tools, usually need to prepare input data, process and combine the output data. Many tools can hang when fed large amount of data. If you measure some metric with several similar tools you can get somewhat (or quite) different result and you don't know what differences in the implementation of the metric caused it. When you see many reports or statistics you cannot rerun, check, improve or adapt them as they are done using lots of hand work.
More detailed examples of problems
[edit]Tool is developed by WMF since the end of 2012. Advertised and discussed on many conferences. Still cannot be used for most of Wikimedia Ukraine's needs.
- You need to gather lists of users yourself. Often when you need to find relevant users with some tool you are more convenient to continue to calculate metrics with the same tool, rather that feed them to another tool
- You calculate metrics based on user list and time range (or commons category as an exception). This does not define the activities very well. It's mostly suitable for newcomers engaged in their first project. Otherwise users contribute not only for specific project, but outside of it as well. There can be several contests/thematic weeks running simultaneously and users also contribute outside any projects.
- It does not compare thematic activity in the project and outside the project.
- It does not analyze user activity outside the project more than just giving user some label (active/not active)
- Hangs for large datasets[1]
- Does not understand content persistence. User that inserts copyvio 3 times and gets reverted will be counted as contributing (copyvio size) * 3 bytes
Existing reports
[edit]Like Grants:Evaluation/Evaluation reports, mw:Wikimedia Research/Research and Data, Grant reports
Are manually (with help of some tools, but not automatically) created, no data and algorithms published, one cannot easily much build upon it to investigate, improve, apply to needs, repeat. When you have such reports you only get some very limited ratios/medians/summaries predefined by the report authors and cannot improve the algorithms / ways to present or investigate the data.
It looks like huge effort goes into each of these reports while it looks better to write a tool to gather the statistics, configure it and reuse. Probably other statistics is more interesting like what percentage of monuments are listed / have articles / sources / how good are the articles / are they illustrated / in what regions / how far from populated cities / what architecture styles.
What is your solution?
[edit]Provide a toolbox that will support querying most of the available data sources.
- Mediawiki API module will
- use metadata from API:Parameter information to generate and assure full API implementation
- provide UI similar to Special:ApiSandbox but configurable, reusable and more user friendly
- There should be similar modules for SQL queries and XML dumps processing
- There should be at least basic Wikidata support.
- Each execution command or step should be able to
- executed independently
- combined with other commands
- read input from various sources (external formats and other commands output)
- produce output to various sinks (external formats and other commands input)
- supported external formats should include csv, json, xml, html tables, structured wikitext, Extension:Graph source data
- Toolbox should be configurable
- Execution commands should be configurable from configuration files or web UI.
- It should be possible to restrict configuration options available to end user, so specific configurations can be used to achieve specific goals.
- Users should be able to create, adapt and share commands and configurations
- There will be automated unit and integration tests that will make make sure that implementation works correctly. I'll use Travis configuration from Semantic Mediawiki that already provides configuration for integration testing against different MediaWiki versions.
Technical details
[edit]More clarification on what I mean by query "configuration".
MediaWiki API request parameters are just collection of key = value pairs. In HTTP GET request it's ?key1=value1&key2=value2&key3=value3...
Not every key makes sense together, many keys are only available when other key is present, so it's like a tree, where child nodes provide more detail for their parent nodes. Tree data structures can be represented via JSON.
{
"action" : {
"module": "query"
"props": {
"module": "revisions"
}
}
}
SQL scripts can be presented as abstract syntax trees (AST) as well. Notion of Language Integrated Query (the article is about most known Microsoft implementation, but the notion is very general) shows similarity between SQL scripts and functional programming. AST classes from Quill LInQ library show how most of the SQL syntax is easily mapped: most of the AST classes and operators.
Data processing frameworks such as Spark or Flink also provide APIs that is very similar to Scala collections, and can process various data sources. Both Spark[2] and Flink[3] allow to process SQL scripts on their data. Below is example of identical API parameters, SQL script and Scala code for Scala collections and Flink Dataset API.
API | SQL | Scala collections | Scala Flink Dataset API |
---|---|---|---|
list = allcategories acprop = size acdir = ascending aclimit = 10 [4] |
select cat_title, cat_pages, cat_subcats, cat_files from category order by cat_title asc limit 10 [5] |
case class Category(id: Long, title: String, pages: Long, subcats: Long, files: Long) val data = Seq(Category(1, "Cat1", 1, 2, 3)) | |
data.sortBy(_.title) .map(c => (c.title, c.pages, c.subcats, c.files)) .take(10) [6] |
ExecutionEnvironment.createLocalEnvironment() .fromCollection(data) .sortPartition(_.title, Order.ASCENDING) .map(c => (c.title, c.pages, c.subcats, c.files)) .first(10) |
Here correspondence is broken up by elements
Description | API | SQL | Scala collections | Scala Flink Dataset API |
---|---|---|---|---|
source | list = allcategories | from category | Seq(Category(1, "Cat1", 1, 2, 3)) | env.fromCollection(Seq(Category(1, "Cat1", 1, 2, 3))) |
fields | acprop = size | select cat_title, cat_pages, cat_subcats, cat_files | .map(c => (c.title, c.pages, c.subcats, c.files)) | |
sorting | acdir = ascending | order by cat_title asc | .sortBy(_.title) | .sortPartition(_.title, Order.ASCENDING) |
limiting | aclimit = 10 | limit 10 | .take(10) | .first(10) |
Choice of workflow frameworks
[edit]I evaluated 2 opensource frameworks for executing workflows described via configuration files
- Kite Morphlines, developed as part of Cloudera Search
- Apache Oozie
Kite Morphlines has small core library (264 KB) with 5 small dependencies[7] totaling 2MB. Apache Oozie core is a little bigger (1.7MB), but it's hardly can be called core with 42 dependencies [8]. All it's highly unneeded dependencies total 60MB. Oozie could be aligned for use only targeted to distributed large data processing on Labs instances, but I want it to be practical on local machines too without need to install and configure many hundreds of megabytes dependencies.
Kite Morphlines Core | Apache Oozie Core | |
---|---|---|
version | 1.1.0[9] | 4.2.0[10] |
released | Jun 15, 2015 | Sep 01, 2015 |
jar size | 264 KB | 1.7 MB |
number of dependencies | 5 | 42[11] |
size with dependencies | 2 MB | 60 MB |
syntax | JSON/HOCON, using popular Typesafe Config library[12] |
XML |
Main developer | Cloudera | Hortonworks? |
Choice of data processing language and framework
[edit]Regardless of how the computing description is stored and executed it can be presented internally using data processing classes that are already defined in some existing framework.
Section #Clarification on how computations are presented as trees (or DAGs) already described how data manipulation can be represented. There were given examples of SQL AST classes in Quill.
Apache Flink has small core[13] so it's a good candidate to use its classes. It has 8 dependencies, the only one large is Hadoop, but it's required only for serialization interfaces, so can be substituted with a jar file that has just these interfaces. For comparison Spark core has 45 dependencies [14]
While Spark is more mature and used, Flink has some advantages over it.
- Flink has own off-heap memory manager[15], while Spark may require thorough JVM heap configuration.
- Flink has lower latency as it's streaming in the first place, while Spark is micro-batching.
- Again to library size, full Flink without Hadoop (which is not required to run Flink[16]) is relatively small.
- Spark is going to release version 2.0[17] and so goes through API unification process and also borrows some concepts such as streaming and optimizations from Flink.
- Flink API is aligned in their concepts with Apache Beam[18]. Apache Beam is opensourced from Google DataFlow that allows different engines such as Flink and Spark[19] and provides advantages in programming model over Spark[20]
- both can work with Hadoop HDFS that WMF uses and plans to use more.
UI
[edit]I'll evaluate and try to use existing data exploration notebooks available such as Jupyter, Zeppelin, Beaker[21].
I'm going to provide Vega output, so graphs can be reused on-wiki via the Graph extension.
Also ElasticSearch is one of the data sinks in Flink. ElasticSearch can have Graphana or Kibana dashboards. However as Graphana 3.0 with revamped plugin model is only just released[22] and Kibana is going to have a major 5.0 release as well [23], I'm not going to spend development time on them and will wait for the maturity of these new versions.
All of them - beaker notebook cell model[24], Graph extensions Vega grammar [25] and ElasticSearch queries are represented using JSON, so are aligned with the suggested configuration model.
Project goals
[edit]- Most reports can be run automatically from the workflow configuration.
- Workflow configurations can be easily edited
Risks
[edit]- Effort consuming participation in other projects. Now Wikimedia Ukraine has 3 paid workers, WLX jury tool will soon have admin UI, many needed tools are implemented so this risk is much lower now.
- Spend a lot of time on learning and investigation and not delivering. I'm leaving out a lot of things like linked data, meta-programming or functional programming techniques, language integrated queries, dashboards or some other tools/frameworks integration. They can drain time and produce risks of having little value to end users.
- Focusing on infrastructure tasks and not implementing the end-user features. I'm going to build work around series of MVPs around specific usages (like WLM, Grants etc.) to avoid this
- Java 8 support on ToolLabs[26]. It's estimated to be the end of May[27] with k8s on debian jessie. Many libraries require Java 8. I have a branch that supports Java 7, but next release of Apache Flink (1.1) is going to require Java 8.
- WMF uses Spark instead of Flink and uses more Oozie than Kite, but this project is going to use Flink and Kite alternative. I can add support for Spark via Apache Beam and both frameworks integrate on the Hadoop HDFS datasource level.
Project plan
[edit]Activities
[edit]- investigate and implement various input and output format and commands configurations by implementing existing ScalaWiki bots and reports that Ukrainian Wikipedia and Wikimedia Ukraine needs in its projects.
- advertise among and get feedback from people who run or take part in similar projects in other Wikimedia projects, chapters and communities to get feedback and suggestions. Examples:
- Wikimania sessions (WLM tools session)
- Grant reports (midterm, final)
- conferences (CEE conference)
- international events (WLM)
- Generate and implement UI and backend for most Mediawiki API functionality with assistance of API:Parameter information metadata
- implement automated tests, including integration tests
Budget
[edit]- Software development: 6 months * 4240 USD (Median salary for Senior Software Engineer in Kyiv is 4000$ for Scala and 3900$ for Java with 10+ years + about 6% tax)
- Total Budget: 25440 USD
Community engagement
[edit]After implementing each end user facing feature contact the relevant communities and showcase it to them.
Examples:
- WLM/WLE statistics for WLM/WLE organizers
- Article statistics for Wikiprojects users and thematic weeks/article contests organizers
- Grant reports for those who mention specific metrics in their grants, Grant committees/WMF Grant staff
Sustainability
[edit]- Existing tools where investigated and will be used where appropriate. They are quite mature and backed by respective communities and companies
- Apache Flink is used by 10+ companies, 5 major open source projects, 10 universities/research institutes[28]. Has 262 contributors, 144 of them contributed during last year (about 50% higher than the previous year)[29]
- Kite SDK is developed by Cloudera and used by several companies and open source projects such as Apache Sqoop, Apache Solr, Stratio Sparta. Has 50 contributors, 20 of them during last year[30]
- Typesafe Config is developed by Typesafe and used in projects like Akka, Play Framework and has more than 600 artifacts that depend on it on Maven Central[31]. Has 40 contributors, including 10 during last year[32].
- UI part of the tool interacting with Flink, Kite and Typesafe Config can have two modules: generic and Wikimedia projects related. Generic part can be suggested to be used in respective communities.
- Project will be highly modular providing that
- it can attract users/developers interested only in specific parts such MediaWiki API client or XML dumps processor
- modules can be reused for other tools
- Project is planned to have high unit and integration test coverage providing that
- functionality has higher chance to work properly, so
- more users can be attracted and retained
- less maintenance is needed
- it's faster to detect when and what is broken
- functionality has higher chance to work properly, so
- Users can create different data processing configurations for their needs.
- The need for tool emerged from the tool developments for Wikimedia Ukraine chapter, so at least one chapter is interested in using, developing and promoting it.
- Effort is planned to make this tool used by several chapters, communities, WMF departments
- If tool becomes successful it can be supported by the WMF
- By linking documentation and tests for commands to very good MediaWiki API documentation it's possible to give more grounded and detailed evaluation of various client libraries on features implemented/documented/tested than API:Client code/Evaluations.
- While Scala was successful in implementation and attracting Big Data developers in projects like Spark and Akka and it's popularity is somehow rising, API for other languages like Java, Python and R (The same languages Spark supports) will have to be implemented to reach more developer communities.
Measures of success
[edit]- Number of tools/configurations implemented with the toolbox. It can be measured very differently. Let it be 20 tools/configurations
- Number of projects/communities using the tools. Let it be 20 Wikiprojects/Communities of 20 Wikimedia sites/20 Wikimedia Chapters. I'd like to count active users, not availability of the statistics for some projects, as it's not difficult to generate statistics for everything and count it.
- Number of pageviews for the tools. I don't know numbers for current tools. Let it be 1-10% percents of tools traffic.
- Global metric: Number of active editors involved. Tool can suggest to store user specific configurations and to count number of user accounts authenticated. Let it be 1000.
As for strategic goals I hope it can fall into Stabilize Infrastructure and Encourage Innovation categories. Also there is a chance that tooling can help users achieve other strategic goals. Articles can be directly or indirectly improved as the result, but it's hard to measure.
Get involved
[edit]Participants
[edit]Ilya, software developer with about 10 years experience on Java platform including more than 4 years with Scala programming language.
My software projects for Wikimedia movement include
- Commons:WLX Jury Tool that is used by half of participating countries in Wiki Loves Earth 2015 and Wiki Loves Monuments 2015. Anthere suggested implementing admin UI for it for IEG Grant but I'm going to implement it without IEG support this April.
- Open source at https://github.com/intracer/wlxjury.
- ScalaWiki framework and bots
- Bots include
- Gathering various WLE/WLM statistics and updating monument lists
- Article contests and thematic weeks statistics that includes more advanced contribution metric than article size increase
- Copyvio detection
- GLAM uploads
- Framework features include
- modules for Mediawiki API, SQL and XML Data dumps
- parallel and asynchronous execution using Scala Futures
- convenient Scala wrappers for Sweble to extract data from wikitext tables and templates
- configuration in JSON or it's HOCON superset via Typesafe Config
- Bots include
- Wikimedua Ukraine's finance tool
I worked on several projects that are similar to what I suggest in this grant request.
- Data access service for UBS. It could read data from various data sources, join it together, store in distributed cache and provide to various clients. The data sources, entities, their joining and output were configurable. I made initial engine implementation then second version proof of concept (in Scala for faster development) and implementation. I also implemented automation testing framework for it in Groovy.
- Several projects for Telcordia's Granite Inventory system.
- made initial implementation of a workflow engine project.
- created automated testing tool that generated tests for services from database metadata
- participated in creation of testing tool that extended data classes of Granite Inventory system so that graphs of test objects could be created in database from Spring beans xml configuration.
Community Notification
[edit]Message to the following mailing lists: wikilovesearth, wikilovesmonuments, WMCEE-l
Endorsements
[edit]Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).
- As Ilya pointed out, the Wikimedia environment lacks a higher level of abstraction above low-level APIs for developers to efficiently reuse our massive data sets. I wish this grant will help to fill the gap. Ilya has a record of clever software developments (WLX has become our tool of choice for WLE / WLM jury in France), I trust him to get it right on this project. --EdouardHue (talk) 22:46, 12 April 2016 (UTC)
- Community member: add your name and rationale here.
References
[edit]- ↑ Uploading cohort or running a large report fails
- ↑ Spark SQL
- ↑ Flink Table API and SQL
- ↑ view in API:Sandbox
- ↑ view in Quarry
- ↑ paste into ScalaKata to run
- ↑ mvnrepository Kite Morphlines Core
- ↑ mvnrepository Apache Oozie Core
- ↑ mvnrepository Kite Morphlines Core
- ↑ mvnrepository Apache Oozie Core
- ↑ including Hadoop, ActiveMQ client, Apache HttpClient version from Aug 2007, Embedded HSQLDB database from Jun 2008 which is superseded by H2, another embedded Derby database and many other things such as Jetty Web server, JGit - git implementation in Java and OpenJPA Java Persistence API implementation
- ↑ Typesafe Config
- ↑ mvnrepository flink-core
- ↑ mvnrepository spark-core
- ↑ Off-heap Memory in Apache Flink
- ↑ FAQ: Do I have to install Apache Hadoop to use Flink?
- ↑ Spark 2.0 Technical Preview
- ↑ Apache Beam and Flink
- ↑ http://beam.incubator.apache.org/capability-matrix/ Apache Beam Capability Matrix
- ↑ Dataflow/Beam & Spark: A Programming Model Comparison
- ↑ Jupyter, Zeppelin, Beaker: The Rise of the Notebooks
- ↑ Grafana 3.0 Stable Released
- ↑ Kibana 5.0.0-alpha2 released
- ↑ Beaker Notebook Create an OutputDisplay
- ↑ http://vega.github.io/
- ↑ run the bot on k8s/jessie, which has an openjdk-8 backport.
- ↑ Tentatively, I'd like us to have rolled out the new webservice code before end of April, and our kubernetes install pretty solid by end of May.
- ↑ Powered by Flink
- ↑ OpenHub on Flink
- ↑ OpenHub on Kite SDK
- ↑ mvnrepository Artifacts using Typesafe Config
- ↑ OpenHub on Typesafe Config