Jump to content

Grants:IEG/ScalaWiki data processing toolbox

From Meta, a Wikimedia project coordination wiki
statusnot selected
ScalaWiki data processing toolbox
summaryUniversal Wikipedia related data processing toolbox
targetAll that have communities interested in statistics and tooling
strategic priorityincreasing reach
amount25440 USD
granteeIlya
contact• intracer(_AT_)gmail.com
this project needs...
volunteer
advisor
join
endorse
created on14:36, 12 April 2016 (UTC)
round 1 2016



Project idea

[edit]

What is the problem you're trying to solve?

[edit]

Absence of universal toolbox to define and execute processing of Wikipedia related data.

There are some basic ways to retrieve data

While theoretically all of these can provide access to any data present on Wikimedia projects, in practice users need to write scripts or tools for data access.

Tool developer going up the abstraction level from mentioned data sources will see a large amount of client libraries for processing MediaWiki API, tools to import or process dumps etc. (incomplete list).

Many of them provide set of functionality limited to a varying degree comparing to the lower level API, lack of documentation and often the code coverage with automated tests is not enough to protect from various bugs. You may find that features you'd like to use or adapt are only achievable with a combination of many tools all in different programming languages/styles and quality level.

On top of this for end users there are many different tools. One example of such topical list of tools is GLAM tools list. Many tools are no longer working or are unsupported. They usually provide fixed and hardcoded functionality so there are many requests for more tools.

You cannot easily combine data from various tools, usually need to prepare input data, process and combine the output data. Many tools can hang when fed large amount of data. If you measure some metric with several similar tools you can get somewhat (or quite) different result and you don't know what differences in the implementation of the metric caused it. When you see many reports or statistics you cannot rerun, check, improve or adapt them as they are done using lots of hand work.

More detailed examples of problems

[edit]

Tool is developed by WMF since the end of 2012. Advertised and discussed on many conferences. Still cannot be used for most of Wikimedia Ukraine's needs.

  • You need to gather lists of users yourself. Often when you need to find relevant users with some tool you are more convenient to continue to calculate metrics with the same tool, rather that feed them to another tool
  • You calculate metrics based on user list and time range (or commons category as an exception). This does not define the activities very well. It's mostly suitable for newcomers engaged in their first project. Otherwise users contribute not only for specific project, but outside of it as well. There can be several contests/thematic weeks running simultaneously and users also contribute outside any projects.
  • It does not compare thematic activity in the project and outside the project.
  • It does not analyze user activity outside the project more than just giving user some label (active/not active)
  • Hangs for large datasets[1]
  • Does not understand content persistence. User that inserts copyvio 3 times and gets reverted will be counted as contributing (copyvio size) * 3 bytes
Existing reports
[edit]

Like Grants:Evaluation/Evaluation reports, mw:Wikimedia Research/Research and Data, Grant reports

Are manually (with help of some tools, but not automatically) created, no data and algorithms published, one cannot easily much build upon it to investigate, improve, apply to needs, repeat. When you have such reports you only get some very limited ratios/medians/summaries predefined by the report authors and cannot improve the algorithms / ways to present or investigate the data.

It looks like huge effort goes into each of these reports while it looks better to write a tool to gather the statistics, configure it and reuse. Probably other statistics is more interesting like what percentage of monuments are listed / have articles / sources / how good are the articles / are they illustrated / in what regions / how far from populated cities / what architecture styles.

What is your solution?

[edit]

Provide a toolbox that will support querying most of the available data sources.

  • Mediawiki API module will
  • There should be similar modules for SQL queries and XML dumps processing
  • There should be at least basic Wikidata support.
  • Each execution command or step should be able to
    • executed independently
    • combined with other commands
    • read input from various sources (external formats and other commands output)
    • produce output to various sinks (external formats and other commands input)
    • supported external formats should include csv, json, xml, html tables, structured wikitext, Extension:Graph source data
  • Toolbox should be configurable
    • Execution commands should be configurable from configuration files or web UI.
    • It should be possible to restrict configuration options available to end user, so specific configurations can be used to achieve specific goals.
    • Users should be able to create, adapt and share commands and configurations
  • There will be automated unit and integration tests that will make make sure that implementation works correctly. I'll use Travis configuration from Semantic Mediawiki that already provides configuration for integration testing against different MediaWiki versions.

Technical details

[edit]
Clarification on how computations are presented as trees (or DAGs)
[edit]

More clarification on what I mean by query "configuration".

MediaWiki API request parameters are just collection of key = value pairs. In HTTP GET request it's ?key1=value1&key2=value2&key3=value3...

Not every key makes sense together, many keys are only available when other key is present, so it's like a tree, where child nodes provide more detail for their parent nodes. Tree data structures can be represented via JSON.

{
  "action" : {
     "module": "query"
      "props": {
            "module": "revisions"
       }
  }
}

SQL scripts can be presented as abstract syntax trees (AST) as well. Notion of Language Integrated Query (the article is about most known Microsoft implementation, but the notion is very general) shows similarity between SQL scripts and functional programming. AST classes from Quill LInQ library show how most of the SQL syntax is easily mapped: most of the AST classes and operators.

Data processing frameworks such as Spark or Flink also provide APIs that is very similar to Scala collections, and can process various data sources. Both Spark[2] and Flink[3] allow to process SQL scripts on their data. Below is example of identical API parameters, SQL script and Scala code for Scala collections and Flink Dataset API.

All categories
API SQL Scala collections Scala Flink Dataset API
list = allcategories 
acprop = size
acdir = ascending
aclimit = 10 [4]	
select 
cat_title, cat_pages, cat_subcats, cat_files 
from category 
order by cat_title asc 
limit 10 [5]
case class Category(id: Long, 
title: String, 
pages: Long, 
subcats: Long, 
files: Long)
val data = Seq(Category(1, "Cat1", 1, 2, 3))
 data.sortBy(_.title)
 .map(c => (c.title, c.pages, c.subcats, c.files))
 .take(10) [6]
 ExecutionEnvironment.createLocalEnvironment()
 .fromCollection(data)
 .sortPartition(_.title, Order.ASCENDING)
 .map(c => (c.title, c.pages, c.subcats, c.files))
 .first(10)

Here correspondence is broken up by elements

Description API SQL Scala collections Scala Flink Dataset API
source list = allcategories from category Seq(Category(1, "Cat1", 1, 2, 3)) env.fromCollection(Seq(Category(1, "Cat1", 1, 2, 3)))
fields acprop = size select cat_title, cat_pages, cat_subcats, cat_files .map(c => (c.title, c.pages, c.subcats, c.files))
sorting acdir = ascending order by cat_title asc .sortBy(_.title) .sortPartition(_.title, Order.ASCENDING)
limiting aclimit = 10 limit 10 .take(10) .first(10)
Choice of workflow frameworks
[edit]

I evaluated 2 opensource frameworks for executing workflows described via configuration files

Kite Morphlines has small core library (264 KB) with 5 small dependencies[7] totaling 2MB. Apache Oozie core is a little bigger (1.7MB), but it's hardly can be called core with 42 dependencies [8]. All it's highly unneeded dependencies total 60MB. Oozie could be aligned for use only targeted to distributed large data processing on Labs instances, but I want it to be practical on local machines too without need to install and configure many hundreds of megabytes dependencies.

Kite Morphlines Core Apache Oozie Core
version 1.1.0[9] 4.2.0[10]
released Jun 15, 2015 Sep 01, 2015
jar size 264 KB 1.7 MB
number of dependencies 5 42[11]
size with dependencies 2 MB 60 MB
syntax JSON/HOCON,
using popular Typesafe Config library[12]
XML
Main developer Cloudera Hortonworks?
Choice of data processing language and framework
[edit]

Regardless of how the computing description is stored and executed it can be presented internally using data processing classes that are already defined in some existing framework.

Section #Clarification on how computations are presented as trees (or DAGs) already described how data manipulation can be represented. There were given examples of SQL AST classes in Quill.

Apache Flink has small core[13] so it's a good candidate to use its classes. It has 8 dependencies, the only one large is Hadoop, but it's required only for serialization interfaces, so can be substituted with a jar file that has just these interfaces. For comparison Spark core has 45 dependencies [14]

While Spark is more mature and used, Flink has some advantages over it.

  • Flink has own off-heap memory manager[15], while Spark may require thorough JVM heap configuration.
  • Flink has lower latency as it's streaming in the first place, while Spark is micro-batching.
  • Again to library size, full Flink without Hadoop (which is not required to run Flink[16]) is relatively small.
  • Spark is going to release version 2.0[17] and so goes through API unification process and also borrows some concepts such as streaming and optimizations from Flink.
  • Flink API is aligned in their concepts with Apache Beam[18]. Apache Beam is opensourced from Google DataFlow that allows different engines such as Flink and Spark[19] and provides advantages in programming model over Spark[20]
  • both can work with Hadoop HDFS that WMF uses and plans to use more.
UI
[edit]

I'll evaluate and try to use existing data exploration notebooks available such as Jupyter, Zeppelin, Beaker[21].

I'm going to provide Vega output, so graphs can be reused on-wiki via the Graph extension.

Also ElasticSearch is one of the data sinks in Flink. ElasticSearch can have Graphana or Kibana dashboards. However as Graphana 3.0 with revamped plugin model is only just released[22] and Kibana is going to have a major 5.0 release as well [23], I'm not going to spend development time on them and will wait for the maturity of these new versions.

All of them - beaker notebook cell model[24], Graph extensions Vega grammar [25] and ElasticSearch queries are represented using JSON, so are aligned with the suggested configuration model.

Project goals

[edit]
  • Most reports can be run automatically from the workflow configuration.
  • Workflow configurations can be easily edited

Risks

[edit]
  • Effort consuming participation in other projects. Now Wikimedia Ukraine has 3 paid workers, WLX jury tool will soon have admin UI, many needed tools are implemented so this risk is much lower now.
  • Spend a lot of time on learning and investigation and not delivering. I'm leaving out a lot of things like linked data, meta-programming or functional programming techniques, language integrated queries, dashboards or some other tools/frameworks integration. They can drain time and produce risks of having little value to end users.
  • Focusing on infrastructure tasks and not implementing the end-user features. I'm going to build work around series of MVPs around specific usages (like WLM, Grants etc.) to avoid this
  • Java 8 support on ToolLabs[26]. It's estimated to be the end of May[27] with k8s on debian jessie. Many libraries require Java 8. I have a branch that supports Java 7, but next release of Apache Flink (1.1) is going to require Java 8.
  • WMF uses Spark instead of Flink and uses more Oozie than Kite, but this project is going to use Flink and Kite alternative. I can add support for Spark via Apache Beam and both frameworks integrate on the Hadoop HDFS datasource level.

Project plan

[edit]

Activities

[edit]
  • investigate and implement various input and output format and commands configurations by implementing existing ScalaWiki bots and reports that Ukrainian Wikipedia and Wikimedia Ukraine needs in its projects.
  • advertise among and get feedback from people who run or take part in similar projects in other Wikimedia projects, chapters and communities to get feedback and suggestions. Examples:
    • Wikimania sessions (WLM tools session)
    • Grant reports (midterm, final)
    • conferences (CEE conference)
    • international events (WLM)
  • Generate and implement UI and backend for most Mediawiki API functionality with assistance of API:Parameter information metadata
  • implement automated tests, including integration tests

Budget

[edit]
  • Software development: 6 months * 4240 USD (Median salary for Senior Software Engineer in Kyiv is 4000$ for Scala and 3900$ for Java with 10+ years + about 6% tax)
  • Total Budget: 25440 USD

Community engagement

[edit]

After implementing each end user facing feature contact the relevant communities and showcase it to them.

Examples:

  • WLM/WLE statistics for WLM/WLE organizers
  • Article statistics for Wikiprojects users and thematic weeks/article contests organizers
  • Grant reports for those who mention specific metrics in their grants, Grant committees/WMF Grant staff

Sustainability

[edit]
  • Existing tools where investigated and will be used where appropriate. They are quite mature and backed by respective communities and companies
    • Apache Flink is used by 10+ companies, 5 major open source projects, 10 universities/research institutes[28]. Has 262 contributors, 144 of them contributed during last year (about 50% higher than the previous year)[29]
    • Kite SDK is developed by Cloudera and used by several companies and open source projects such as Apache Sqoop, Apache Solr, Stratio Sparta. Has 50 contributors, 20 of them during last year[30]
    • Typesafe Config is developed by Typesafe and used in projects like Akka, Play Framework and has more than 600 artifacts that depend on it on Maven Central[31]. Has 40 contributors, including 10 during last year[32].
  • UI part of the tool interacting with Flink, Kite and Typesafe Config can have two modules: generic and Wikimedia projects related. Generic part can be suggested to be used in respective communities.
  • Project will be highly modular providing that
    • it can attract users/developers interested only in specific parts such MediaWiki API client or XML dumps processor
    • modules can be reused for other tools
  • Project is planned to have high unit and integration test coverage providing that
    • functionality has higher chance to work properly, so
      • more users can be attracted and retained
      • less maintenance is needed
    • it's faster to detect when and what is broken
  • Users can create different data processing configurations for their needs.
  • The need for tool emerged from the tool developments for Wikimedia Ukraine chapter, so at least one chapter is interested in using, developing and promoting it.
  • Effort is planned to make this tool used by several chapters, communities, WMF departments
  • If tool becomes successful it can be supported by the WMF
  • By linking documentation and tests for commands to very good MediaWiki API documentation it's possible to give more grounded and detailed evaluation of various client libraries on features implemented/documented/tested than API:Client code/Evaluations.
  • While Scala was successful in implementation and attracting Big Data developers in projects like Spark and Akka and it's popularity is somehow rising, API for other languages like Java, Python and R (The same languages Spark supports) will have to be implemented to reach more developer communities.

Measures of success

[edit]
  • Number of tools/configurations implemented with the toolbox. It can be measured very differently. Let it be 20 tools/configurations
  • Number of projects/communities using the tools. Let it be 20 Wikiprojects/Communities of 20 Wikimedia sites/20 Wikimedia Chapters. I'd like to count active users, not availability of the statistics for some projects, as it's not difficult to generate statistics for everything and count it.
  • Number of pageviews for the tools. I don't know numbers for current tools. Let it be 1-10% percents of tools traffic.
  • Global metric: Number of active editors involved. Tool can suggest to store user specific configurations and to count number of user accounts authenticated. Let it be 1000.

As for strategic goals I hope it can fall into Stabilize Infrastructure and Encourage Innovation categories. Also there is a chance that tooling can help users achieve other strategic goals. Articles can be directly or indirectly improved as the result, but it's hard to measure.

Get involved

[edit]

Participants

[edit]

Ilya, software developer with about 10 years experience on Java platform including more than 4 years with Scala programming language.

My software projects for Wikimedia movement include

I worked on several projects that are similar to what I suggest in this grant request.

  • Data access service for UBS. It could read data from various data sources, join it together, store in distributed cache and provide to various clients. The data sources, entities, their joining and output were configurable. I made initial engine implementation then second version proof of concept (in Scala for faster development) and implementation. I also implemented automation testing framework for it in Groovy.
  • Several projects for Telcordia's Granite Inventory system.
    • made initial implementation of a workflow engine project.
    • created automated testing tool that generated tests for services from database metadata
    • participated in creation of testing tool that extended data classes of Granite Inventory system so that graphs of test objects could be created in database from Spring beans xml configuration.

Community Notification

[edit]

Message to the following mailing lists: wikilovesearth, wikilovesmonuments, WMCEE-l

Endorsements

[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • As Ilya pointed out, the Wikimedia environment lacks a higher level of abstraction above low-level APIs for developers to efficiently reuse our massive data sets. I wish this grant will help to fill the gap. Ilya has a record of clever software developments (WLX has become our tool of choice for WLE / WLM jury in France), I trust him to get it right on this project. --EdouardHue (talk) 22:46, 12 April 2016 (UTC)
  • Community member: add your name and rationale here.

References

[edit]
  1. Uploading cohort or running a large report fails
  2. Spark SQL
  3. Flink Table API and SQL
  4. view in API:Sandbox
  5. view in Quarry
  6. paste into ScalaKata to run
  7. mvnrepository Kite Morphlines Core
  8. mvnrepository Apache Oozie Core
  9. mvnrepository Kite Morphlines Core
  10. mvnrepository Apache Oozie Core
  11. including Hadoop, ActiveMQ client, Apache HttpClient version from Aug 2007, Embedded HSQLDB database from Jun 2008 which is superseded by H2, another embedded Derby database and many other things such as Jetty Web server, JGit - git implementation in Java and OpenJPA Java Persistence API implementation
  12. Typesafe Config
  13. mvnrepository flink-core
  14. mvnrepository spark-core
  15. Off-heap Memory in Apache Flink
  16. FAQ: Do I have to install Apache Hadoop to use Flink?
  17. Spark 2.0 Technical Preview
  18. Apache Beam and Flink
  19. http://beam.incubator.apache.org/capability-matrix/ Apache Beam Capability Matrix
  20. Dataflow/Beam & Spark: A Programming Model Comparison
  21. Jupyter, Zeppelin, Beaker: The Rise of the Notebooks
  22. Grafana 3.0 Stable Released
  23. Kibana 5.0.0-alpha2 released
  24. Beaker Notebook Create an OutputDisplay
  25. http://vega.github.io/
  26. run the bot on k8s/jessie, which has an openjdk-8 backport.
  27. Tentatively, I'd like us to have rolled out the new webservice code before end of April, and our kubernetes install pretty solid by end of May.
  28. Powered by Flink
  29. OpenHub on Flink
  30. OpenHub on Kite SDK
  31. mvnrepository Artifacts using Typesafe Config
  32. OpenHub on Typesafe Config