Grants talk:Project/ContentMine/ScienceSource
Add topicWikidata community notification
[edit]It seems to be missing, at least for this proposal. --Jura1 (talk) 10:01, 3 February 2018 (UTC)
- Coming. I will do notifications tomorrow, over a wide range of WikiProjects on English Wikipedia, and also of course on Wikidata. Charles Matthews (talk) 09:58, 4 February 2018 (UTC)
Comments
[edit]This new proposal looks like a continuation Grants:Project/ContentMine/WikiFactMine, does not it? WikiFactMine also involved mining biomedical literature as I remember. Can you compare these two projects? What is similar? What is different? What are the results of the WikiFactMine project how you are going to build on them? Ruslik (talk) 12:26, 7 February 2018 (UTC)
- Grants:Project/ContentMine/WikiFactMine/Final is now posted, and it should answer some of your questions. To try to put it simply: the fact-mining mechanism would remain quite similar. No final implementation decisions have been taken, but the new project can be understood as changing the inputs and outputs of WikiFactMine.
- The inputs would be much more selective. The outputs would be handled in a different way, with human contributions held in a new kind of data structure. Behind the technical choices, there are ideas coming from SPARQL queries, such as "literature search" turned into queries run on Wikidata. The stated object is to write some code to make decisions about referencing, and that also might be quite like SPARQL, too. Charles Matthews (talk) 13:50, 7 February 2018 (UTC)
The emphasis is also on the full-text of the Science/Medical Sources. WikiFactMine collected bibliography; Science Source will analyze full-text and annotate this. This is coupled with human curation. The output will be 30 000 top quality biomedical articles. Petermr (talk) 08:37, 8 February 2018 (UTC)
A very interesting article [1] arguing that Wikipedia is now the "go to" place for science articles. This is exactly what Science Source will provide - high quality, annotated, curated biomedical science. There is not really any other site that provides this(perhaps F1000 but that is smaller and anyway could be subsumed in Science Source). Petermr (talk) 08:41, 8 February 2018 (UTC)
Questions about applicant and alternative funding sources
[edit]Hi
Three questions, these are not questions about the value of the work that is proposed:
- Is this a grant application by the for profit limited company Contentmine Limited with the directors of the company applying as participants?
- The scope of the project is to "develop machine-assisted human-reviewing software based on MEDRS guidelines", is this a commercial product that ContentMine Ltd will sell services for? Does this a grant to fund research and development for a private company? If so why isn't this being funded through a business development loan?
- Given the Wikimedia Foundation's limited grant funds and the significant grant size ($360 less than the $100,000 limit on project grants by WMF) have the applicants explored or applied for alternative sources of grant funding for the same work, e.g through grants at the University of Cambridge (a university with a £6.3 billion endowment) or other larger grant sources?
Thanks
John Cummings (talk) 12:12, 8 February 2018 (UTC)
Short answers: the grant is for ContentMine. The resulting software and content would all be open, rather than a commercial product. No, the project is tailored to Wikimedia needs, and has been developed taking into account suggestions and feedback from Wikimedia communities, as well as individuals, and WMF staff.
The points of principle deserve longer answers, of course. Here are further points:
- ContentMine lost money on the first WikiFactMine project (see the final report link above). It defines itself as not-for-profit, which in UK terms is perhaps too loose for the lawyers, but in practical terms it is run to break even. It is a start-up, but is definitely not backed by any venture capital.
- ContentMine software is released under open license and is available on GitHub. So there is no commercial product there.
- ContentMine is actively involved with finding work from the academic sector. It has a partner in UCL, for example. Any academic institution has its agenda, and obviously the company would be happy to work with any such institution that was interested in developing Wikidata and medical content on Wikipedia. If that has not happened so far, it is not for want of trying. The final report contains a list (actually not exhaustive) of contacts.
I'd be glad to amplify further, within my own knowledge of the company and funding efforts I have been involved in.
Charles Matthews (talk) 14:22, 8 February 2018 (UTC)
(adding to Charles Matthews)
(1) ContentMine has adopted a special "OpenLock" clause created by the Shuttleworth Foundation which prevents the company from being acquired and from doing work outside its mission statement. This is the closest in the UK that we can get to declaring as "non-profit" other than registering as a charity.
Petermr (talk) 15:07, 8 February 2018 (UTC)
- Thanks very much for the answers @Petermr: and @Charles Matthews:
- Could you pick up on the third question as well about alternative grant sources? Given the Wikimedia Foundation's limited grant funds and the significant grant size ($360 less than the $100,000 limit on project grants by WMF) have the applicants explored or applied for alternative sources of grant funding for the same work, e.g through grants at the University of Cambridge (a university with a £6.3 billion endowment) or other larger grant sources?
- Thanks again
Since we work, mostly, in central Cambridge, it would perhaps seem natural to look to the University for funds. On the other hand, universities are not usually grant-giving bodies. University departments do sometimes outsource work, and so that context is the rather different one of seeking contract work (which is the deal with UCL). Google for "grant from the University of Cambridge" and you'll not find much.
On the other hand, the WMF has funded this type of research before, in particular a couple of years ago for a group at Stanford University. That really is more normal, in that university departments look for grants from outside bodies. The WMF itself looks for grants, for example from the Sloan Foundation which has given several times. ContentMine has looked into this area, and for Sloan and similar organisations there are usually minimum sums (seven figures in USD, say). I don't feel at liberty to go into detail, but applications have been made by the company in this context of foundations and larger sums.
It's horses for courses, really. The WMF has a grants system with several tiers, and this application is at the same level as the one made at the end of 2016. It translates as saying the work involved would require two people for a year, or equivalent. What happened last time round is close enough, surely, for a like-for-like comparison to make sense. Grants mean detailed reporting. I should mention that currently 1 GBP is around 1.40 USD, meaning that the money wouldn't go as far as in 2016.
I hope this clarifies the context. Peter mentioned the Shuttleworth Foundation above, and it was one of their grants that saw ContentMine set up in the first place. Applying for grants is one of the things the company is expected to do, really. Software development outside the commercial mainsteam is not fantastically easy to get done (as Wikimedia UK has found, for instance, about which I'm quite well informed). The WMF is forward-looking in this area, wants value for money, runs quite a tight ship as far as process is concerned, and I can see both sides of the story here. Charles Matthews (talk) 19:22, 8 February 2018 (UTC)
- Hi @Charles Matthews:, thanks very much for the explanation. I'm sorry for not being clear, I guess I'm looking for clarification on the business model of ContentMine and if this grant (which I assume is a significant portion of their overall project grants fund for this cycle) is funding a for profit business to develop their commercial product. John Cummings (talk) 19:02, 11 February 2018 (UTC)
What was said above amounts to this: "for profit" in ContentMine's terms should be probably be understood in this fashion, which is a "business model" if you will: (a) any profit on a contract it undertakes it retains to keep the company going; (b) there are costs, such as having shared space to work in, a manager, website design and maintenance, and conference attendance, which would be charged against the profits of such contracts; (c) there are no shareholders or backers withdrawing profits. You might note that the final report mechanism of WMF grants asks about unspent money, so that such grants, as opposed to contracts, are not a source of profit. Peter has pointed to the "open lock" applied to the company. To amplify the history: ContentMine was originally a grant-funded project, writing software and doing advocacy. It morphed into a company, not so much as a business strategy, but so as to be able to bid for bread-and-butter contract work. I read in The Guardian just recently that there are thought to be 4,700 companies in Cambridge, in the tech and life sciences sector. While ContentMine may not be quite unique, it is unlike most of those.
When you speak of "commercial product", it is still not clear to me what you mean. In supplying software services, the outcome of contract A may be expertise that allows the company to apply for contract B; the outcome of grant C might also allow the company to apply for contract B. In July last year I was writing SPARQL queries for ContentMine, which are now on Wikidata. I'm not a developer, in any sense, but that experience would allow me to contribute SPARQL to other projects. Charles Matthews (talk) 06:19, 12 February 2018 (UTC)
Branding, communication, using simpler introductions
[edit]I am writing to ask if you can talk through your plans to make this project and related projects more easily understandable to for naive audiences who might understand science but not wiki, understand only Wikipedia, understand only Wikimedia projects and Wikidata, or understand neither science nor wiki. I do not necessarily expect you to succeed at this, but can you either show published attempts to briefly explain your project to these demographics or commit to do so if funded? I appreciate the attempts you have made on this grant request and wonder if you can go even further if this project is funded.
Also, can you commit to explain and document in an introductory way all projects which ContentMine does with Wikimedia funding? I see that just a few days ago you published a report at Grants:Project/ContentMine/WikiFactMine/Final for a similar grant of about $100k. It is not clear to me what team overlap there is between that project and this one. How would ContentMine feel about applying to be a Wikimedia user group then following the documentation and reporting procedures outlined for that sort of entity? I cannot say that the user group model will be a perfect fit, but as a Wikimedian, the user group model of engagement is familiar to me. If your team registered in this way then I could rely on other checks on your progress, like the extent to which you keep registration up to date, that you maintain a single place for communication where if any troubles arise then everyone would be sure to be able to find or report them, and that there is continuity in the management of your projects from year to year. I raise this issue because the relationship of ContentMine to this project versus any other project is not clear to me, and if there is a series of programs with relationships to each other, then the wiki user group model is a familiar way for me or other Wikimedians to understand what is happening.
To the extent that this project is related to the WikiFactMine project, I appreciate the newsletter reporting and outreach at en:User:Charles Matthews/Facto Post. This kind of updating is beyond the norm and I think it sets a new Wikimedia community standard for projects giving updates to the general Wikimedia community about their progress.
I know that outreach to completely uninformed audiences is outside the norm in the sciences, but in the Wikimedia ecosystem, all kinds of people examine projects totally outside their field of understanding, and it is very helpful when especially larger and better funded projects anticipate that all sorts of people will stumble upon their project pages and require simple, short explanations to leave with any understanding at all. I appreciate whatever you can do to establish and curate landing pages with the most simple introductions that you can present. We all depend on support from the whole community, and at a minimum, the whole community should have the simplest understanding of even rather technical projects and their significance. Are you able to make any commitments to produce a few sentences of simple introduction in a few more key places? Blue Rasberry (talk) 16:28, 10 February 2018 (UTC)
- To deal with your points, but not in order. For "team overlap" you are asking about personnel issues for ScienceSource (SS), and I would say that is outside the scope of the normal discussion of grants here.
- For comparison with WikiFactMine (WFM1): there are two pipelines to compare, and what you'd expect is that the earlier WikiFactMine pipeline has been reviewed, some decisions taken about what can be improved, things added at the beginning, and things added at the end. That is where we have got to, in fact, in ideas developed from last August. ScienceSource is certainly the sequel to the WikiFactMine project of 2016-7.
- That said, the best way to understand the relationship might well be to read just Grants:Project/ContentMine/WikiFactMine/Final#What didn't work. Dealing with the fourth of those bullets brings us to w:WP:MEDRS.
- The endgame of WFM1, from my point of view, was mostly a discussion of semi-automated routes into Wikidata (on-topic for the first bullet), in particular gamification and the Primary Sources Tool. The project actually did neither, and the annotations idea works out as a third choice. And a little unusual, from a Wikimedia slant. But annotations in the sense of https://web.hypothes.is/ are flourishing; see also https://europepmc.org/AnnotationsApi. Our own ideas are a bit different, in terms of seeing annotations as a data structure (machine reading as well as writing). But if I start here on implementation, I'll never get done. It is significant for us that QuickStatements now has a "CSV import"; but there are other "export from annotations and upload to Wikidata" routes.
- So, there is a schematic of WFM1 in the final report, and a proposed schematic for SS in the grant application. The subpages of d:Wikidata:WikiFactMine were an extended exposition of WFM1 (including dictionaries, which would be used in much the same way in SS, and which are much slicker these days, which is bullet two from the list). There is actually d:Wikidata:WikiFactMine/ScienceSource, which is a completely unofficial expository draft of mine, of no standing in the proposal process. You might find it of interest.
- The grant proposal is supposed to deal with the "what is this for?" and "how much?" issues in due form. Concision is required. It is a Procrustean bed, for those good reasons.
- For outreach, I'd see a plan like that of 2017. It makes sense to me to develop a suite of pages on Wikidata. There would also be the ScienceSource wiki. There would be introductions in those places and others. As WiR for WikiFactMine, I was actually trying to do a 12 month job in five months, so the timescale was compressed. Much was done face-to-face.
- You make fair points. What is proposed is both ambitious and complex, in terms of both code and human inputs, with some serious data modelling in there. The "problem context" is going to make sense to medical editors. I hope I have begun to address your post. I can give more detailed thinking on specifics. Charles Matthews (talk) 07:11, 11 February 2018 (UTC)
- w:Wikipedia:Why MEDRS? says a great deal of what needs to be said, I see. Charles Matthews (talk) 07:54, 11 February 2018 (UTC)
- Replying to the first paragraph - this is very important to us. There are invisible barriers between academia and Wikimedia and for 10 years I have been challenging academia (who produce much of the research in publications) to embrace Wikipedia (and now Wikidata). I tweet, blog, and lecture to academia about the importance of WM.
- But we also want academia to use WM tools and resources because they are often *better* than what academia uses (e.g. for community knowledge and its management). Academia has (IMO) failed to create Open Access as an approach where citizens are involved and valued. If citizens want knowledge they come to Wikipedia and only then to academic sources (e.g. repositories). Although the primary purpose of SS is to support Wikimedian reviewers and editors, we're optimistic that it will result in a wider use - the most important 30,000 articles in biomedical science. (Altmetric have just published a list of the most "popular' scholarly publications [2] and ca 80% are biomedical. I suspect that many of the downloaders and tweeters are not academics.) So SS could spin-off into a widely used collection of the most wanted science, and the papers would be better as they would be checked and semantically annotated.
- In the other direction we've recently put in an H2020 sub-grant to develop WikiFactMine as the primary tool for semantic text mining. It's better than academic approaches (more comprehensive, better semantics, easier to use than conventional metadata resources). We are enhancing our Open content-mining software with WikiFactMine and stressing this when we submit non-WM grants and tenders - we see WikifactMine as a large win. Petermr (talk) 11:21, 11 February 2018 (UTC)
- What I personally would hope is this. That another 12 months work in the text-mining direction could answer the question, which Wikimedians will see as in the tradition of w:Ward Cunningham, "what is the simplest thing that would actually work?", for the issues posed by the current state of w:scholarly communication. I am not one for the premature counting of chickens. A comms strategy cannot be developed aside from clear ideas of message, channels and stakeholders.
- The rough-cut ideas that currently circulate in this area can be drawn together, and I think Wikimedians can take the "if not us, who?" call to action seriously at this point: we know 2017 was a good year. All that said, there is some nitty-gritty, devil-in-the-detail work to get through. Charles Matthews (talk) 18:45, 11 February 2018 (UTC)
Comments of Glrx
[edit]I would decline this proposal.
From an effectiveness standpoint, it has the same problems as Grants:Project/ContentMine/WikiFactMine/Final#What_didn’t_work. In particular, the project relies on a faulty premise: that there are lots of competent editors who would edit WP if only they had more sources. That is not the Wikipedia that I see. For a technical article, there are a few editors contributing to it. They contribute because they know the field, know reasonable sources for it, and have the judgment to write a decent overview. There are not a lot of information-starved editors who would contribute more if only they had fresh sources. I also run into a lot of editors who are confused: they have sources, but don't understand how to tie everything together.
I'm not sure I see a good distinction between secondary sources (books and survey articles) and primary sources (30,000 journal papers).
More significantly, the proposal does not persuade. What are examples of 'new referenced facts"? How are those facts obtained? I get the sense the proposal wants to be an IBM Watson-lite. A system will ingest 30,000 scientific papers and then provide competent medical advice. What is it going to do with those papers? Will there be some notion of disease models built into the system? In short, I don't see how the proposed system will display judgment. Mentioning dictionaries does not help me. Furthermore, Watson was not a two-man year project.
The budget is troubling. A "Senior SW Developer" man-year for $43K? https://www.payscale.com/research/UK/Job=Senior_Software_Engineer/Salary/382761e9/Cambridge says average pay is £47,000 / $70,000. I'd expect someone doing research work to command a higher than median salary. The project is not building a spiffy website. Consequently, I view the budget as problematic, and that implies the project is unrealistic.
The proposal seems to be more about generic WP meetups and traveling to conferences rather than concrete project goals.
Glrx (talk) 21:21, 16 February 2018 (UTC)
- On the first point you raise: a couple of things from my perspective. Re "there are lots of competent editors who would edit WP if only they had more sources", there are editors on WP who would indeed edit more widely on topics given such sources: let's call these people "encyclopedists". These are the community members who do not fit into your class of people who "know the field" already, whom we could call "experts". The main thrust of the taming of the MEDRS guideline would be to allow such encyclopedists into the expert understanding of the sourcing issues inherent in the use of scientific research in articles on clinical medicine topics.
- This is not a simple-minded thing to convey. In fact it is essentially an expert system issue. If you don't see the "good distinction between secondary sources [...] and primary sources" it is not because such a distinction doesn't exist. There are formal ontologies for review articles. The criteria for systematic reviews in the medical field are serious pieces of work, reflecting their importance, ultimately, for hospital treatment. I think you are on the wrong track in talking about medical advice, by the way.
- A second point is that Wikidata information can arrive in Wikipedias without direct human intervention, through infoboxes. This will be true particularly on the smaller language Wikipedias. If the argument is that "with plenty of expert editors, we don't need this kind of support", then the problem with it is that the premise doesn't hold, across the movement. The recognised number of diseases in the major sources is about 10K, which is at least couple of orders of magnitude larger than the group of medical editors with clinical expertise.
- On budget and recruiting, I'm not going to say much: the WMF will take a view. I said above it looks for value for money, and the comparison with commercial rates in Silicon Fen reflects that. Charles Matthews (talk) 06:46, 17 February 2018 (UTC)
- Hi and thank you for your clarification. I've seen you did not answer the question about the software engineer, could you explain why this salary is so under the market rate ? Léna (talk) 21:42, 16 March 2018 (UTC)
- Not really. The discussion period closed a while ago, by the way. $70K applied to a project grant with upper limit $100K does not leave very much for the community side: and the point is to get human inputs into the annotation system, on a wiki, rather than just write code. Some good housekeeping will be required. Charles Matthews (talk) 08:46, 18 March 2018 (UTC)
- Oh, no :) The community review ended March 12 ; we are in the committee review phase, until March 26. As you can see on Grants:Project/Quarterly/Committee or on my user page, I'm a member of the committee. So, if I understand your answer correctly, the software developper will actually spend more time doing human inputs than actually write code ? Léna (talk) 21:00, 18 March 2018 (UTC)
So, thanks for the clarification. Let me explain then also my role. I have been working closely with the ContentMine management on this proposal, and am named as a participant. But I'm not currently working for the company: I'm writing here as a Wikimedian volunteer. I cannot comment on management decisions, because I'm not going to be taking those, in the end. I have avoided saying anything, I hope, that involves detail on personnel, recruitment, structuring and timeline of the work, beyond the indications on work packages.
That doesn't mean that I'm avoiding these questions, which are indeed on my mind. The contact at ContentMine, Cesar Gomez, who is the Operations Manager, would be able to tell you much more about what is intended.
Let me try to be as helpful as I can, though. The starting assumption that a developer will be hired, in the standard Cambridge UK job market, for 12 months, may be incorrect. It is quite normal for tech jobs to be done remotely - for example Wikimedia UK's developers are not in London. So the premise on how much the work could cost per month is questionable. Also, the work may be divided up between people: ContentMine does have a developer, who is working on other projects at present.
Secondly, there are actually three roles, developer, project management, and community (Wikimedian in Residence). The development work envisaged divides, in code terms, into MediaWiki and some general coding (let us say Python, because there is some consensus about that within ContentMine). The general coding breaks down again, some of it being bot work (pywikibot, we assume). Once the original configuration of the platform and bots is done, the bot work could mostly be handled by the Wikimedian.
Without going into further detail there, at this point, I would say that the key to understanding the intended implementation is to divide it first into "inputs", "processing" and "outputs". Scientific articles will be downloaded, and rendered into a common form of HTML. That process requires a number of types of code. Then the HTML is searched, using ContentMine dictionaries, which is a segmented, massively parallel form of "find" for phrases; and the results stored in an annotation structure. Lastly, once other annotations (from humans also) are created, there will be some export options, into an annotation file type, and RDF.
So the implementation issue, in terms of code and who writes it, is first of all to set up the ScienceSource platform; then begin with inputs (WikiJournal articles could be pasted, more or less), moving on to downloads that require processing. The annotation process can start "by hand" but should ultimately be run through a web form. The outputs really mean extracting some kinds of JSON from the annotation structure.
This, anyway, is an overview of my understanding of the implementation. The requirement at the level of code is to put software together, starting from a base of code that ContentMine already has; on the bots and web form, probably most of it would come from standard libraries.
I hope this helps. Over 12 months the system will become more complex, as parts are added. The financial statement indicates that a team of two will be needed, but anyone can see that some management of roles will probably be required. And start-up companies have to be flexible, versatile. Charles Matthews (talk) 08:09, 19 March 2018 (UTC)
- A further comment on senior software engineer salary. A few weeks ago, I asked a NASA manager what NASA was paying, and his response was $180K and up. Even with NASA's cachet and interesting research projects, it must pay premium rates for programmers. There are too many interesting, well paid, projects around for those with great skills. Glrx (talk) 00:14, 20 March 2018 (UTC)
Sure, there is a skills shortage: it's not just a form of words. Those who simply want to maximise their salary are not going to end up working for a non-profit, one can guess. Charles Matthews (talk) 07:17, 20 March 2018 (UTC)
Eligibility confirmed, round 1 2018
[edit]We've confirmed your proposal is eligible for round 1 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through March 12, 2018.
The committee's formal review for round 1 2018 will occur March 13-March 26, 2018. New grants will be announced April 27, 2018. See the schedule for more details.
Questions? Contact us.--Marti (WMF) (talk) 02:19, 17 February 2018 (UTC)
Perspectives: scalability and maintainability
[edit]Hi and thank you for this very interesting grant proposition. I have some questions regarding the perspectives of this project.
- Your aim is Wikidata, but you never talk about multilinguism. Are annotations and journals be all in English, or in other languages as well ?
- Is the software developed going to be usable for other topics than the medical field ?
- How will bugs of the software be corrected once the grant is over ?
Thank you ! Léna (talk) 21:51, 16 March 2018 (UTC)
- On multilingualism, there are several aspects. Anasuya Sengupta referred to this area in endorsing the project.
- There is multilingualism in the Wikidata sense, meaning that the site can be read in any language for which enough labels and descriptions have been added, and its content can be localised on any Wikipedia under similar conditions (per content area).
- There is multilingualism in terms of inputs, namely the papers downloaded to the site. English is perhaps the default language for science, a statement which I hope will not offend non-anglophones, but certainly not the only language used in the biomedical publications relevant to the project. Let me explain the technical side some more, to give the right picture.
- Searching the papers for terms is done with ContentMine dictionaries, which are now generated by SPARQL queries (possibly with other methods) but any case entirely based on Wikidata. If we wanted a corresponding dictionary in German or French, the same query could be used to generate Wikidata items, and to get the dictionary in, say, German, the language code "en" would need to be replaced, in the dictionary-making tool, by "de". In other words this is a minimal change, by software standards.
- On the site, and in the community, one expects most interchanges to be in English (as on Commons, meta, Wikidata ...). As for annotations, those will be created by machine and by humans, and the humans will be contributing both in machine-readable form and in natural language. Actually humans contributing in some sort of code, by protocols to be worked out by the community, will be different on ScienceSource, compared to other annotation sites. The existence of natural-language annotations in various languages would mostly raise flags in the system.
- On the software: the answer is "yes", there are other possible applications, and they may be the subject of other grant proposals. For practical reasons, nothing was mentioned about those, this time.
- Once the grant is over? ContentMine places its software, as much as makes sense, on GitHub under an open license. So the future maintenance of the software depends on further uses being found, by ContentMine or others. I have to say that the downloading of papers addresses a large question about the scientific literature (online and open access) around which there is going to be continuing interest anyway. Obviously the hope is to make at least some incremental progress in this area of downloading. It would be good to think that, and I have been discussing this area with two of the advisers recently. Charles Matthews (talk) 18:04, 19 March 2018 (UTC)
Aggregated feedback from the committee for ScienceSource
[edit]Scoring rubric | Score | |
(A) Impact potential
|
6.4 | |
(B) Community engagement
|
6.0 | |
(C) Ability to execute
|
6.0 | |
(D) Measures of success
|
5.6 | |
Additional comments from the Committee:
|
This proposal has been recommended for due diligence review.
The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.
Next steps:
- Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
- Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.
Replies to additional comments
[edit]Aggregated by Charles Matthews (talk)
Comment 1
[edit]- The past projects has proven the importance of a Wikipedia reliable, this kind of project help to create a great synergy between the new discoveries and the updated sources in our contents.
Thanks for your supportive comment: we also believe the reliability of Wikipedia can be improved by projects like ScienceSource. In this project, the medical content on Wikidata will be added to, or have better references given. This content will be available, via infoboxes for example, on all Wikipedias.
The criteria for referencing will be those developed by the medical editors on the English Wikipedia.
We also believe ScienceSource can help to increase the synergy between new discovery and updated sources in Wikipedia content. For example, new scientific discoveries, in the medical field, will be added to Wikidata by the ScienceSource project when they are available in good quality secondary sources.
Comment 2
[edit]- I doubt the scalability and sustainability of this activity. A huge number of "facts" have already been mined and have been largely unused.
Thank for your comments. We believe ScienceSource technology has a great potential for scalability, into other languages and disciplines (those outside the medical field) by adapting the techniques developed during this project. In fact, we believe the potential scope of this project is quite large. For now, it will concentrate on biomedical literature, in English. However, there is a great deal more to do in the area of science, and in other languages.
Further, scientific papers contain not just text but images, and tabular data, and both of these kinds of content could be tackled in an expansion of ScienceSource. For example, our platform can be repurposed for table extraction to Commons. Not just by our team - by anyone, since ScienceSource software will be made available on GitHub, for others to benefit from.
We clearly understand that sustainability in non-funded or post-termination projects comes from individuals or groups who know they need the resource and put voluntary effort into it. According to our collective experience, sustainability comes from finding communities of practice who are keen to be early adopters. We think the project is scalable, and it follows we think the further potential applications add to its sustainability.
Additionally, ContentMine commits resources to maintain and update its tools: it has done so with Quickscrape and Getpapers, software developed by ContentMine in 2015 that we have supported allowing people to use it today, with over 200 downloads every month. ScienceSource software will be supported after the project finishes.
We also believe our auxiliary tools can be of great use. For example, according to Daniel Mietchen, the Fatameh tool from the previous WikiFactMine project was essential for the further development of the Scholia project. And as a concrete example for fact mining, we have the Zika corpus, which has made it into a dedicated WikiProject,
Comment 3
[edit]- The impact potential is high but *only* if the project is done right. As far as I can see there is limited added value in just adding article items to Wikidata as long as these items are not used. The real perfect fit with strategic priorities would be a focus on adding references to existing items, adding statements etc., in which case the project will be more sustainable and more impactful.
Thanks for the positive comment on ScienceSource's potential impact: we also believe it could be high. The project will concentrate on referenced statements, which will indeed mostly be on existing items by the nature of the way the statements are found.
There are other features that will bring value for the community, and they will have a say in those. We shall be looking for indications on which papers should be processed, by general area. The figure of 30,000 papers was arrived at in this way: the number of Cochrane reviews (not open source) in the medical field is about 7,000. The intention is to try to match that coverage, with open source papers, by downloading and processing about four times as many articles. The facts extracted from those papers should show the basic scientific evidence for some major areas of treatment. The project is indeed interested in relevant facts, rather than volume measures. We will ask for steers from our ScienceSource wiki community as we go.
Further, the metric proposed for metadata addition will count the way Wikidata's science bibliography items are expanded with relevant information. The two go together: the basic thrust is to make is easier to check that biomedical facts on Wikidata are appropriately referenced, by automation. The project will consciously work to build up the data on which algorithmic checking depends.
Comment 4
[edit]- The approach is iterative - it is a direct continuation of the ContentMine project. The potential real world impacts are likely to be modest as has already happened with ContentMine. I doubt that a sufficient number of human editors will ever participate.
Thanks for your comments. We would like to express our own view by starting with a couple of distinctions. Firstly, WMF grant applications, as is common practice, are structured with metrics for outputs, but also assessment of impact in terms of the original proposed aim, which was to make Wikidata a go-to place for science. Secondly, WikiFactMine, the previous project, is only part of ContentMine. It is just not accurate to say that WikiFactMine, or ContentMine, have had low impact, on the basis of the measured outputs of WikiFactMine.
All that said, we believe that the ScienceSource proposal is not an iterative continuation of the WikiFactMine project. ScienceSource indeed will use part of the technology developed by ContentMine during WikiFactMine project, but applied in a medical review context. It closely targets quality of information, in the area where good information matters most to almost everyone.
Here are details of how we intend to engage Wikipedian medical editors. We want to build on:
- Wikipedia:Cochrane, a project already engaged in the use of systematic reviews on Wikipedia
- Discussion on w:Wikipedia talk:WikiProject Medicine of the extensive talk page archives for the project, which is a major source for edge cases in medical referencing
- A simplistic "three-dimensional model" of w:WP:MEDRS (restrict to reviews, papers no more than five years old, white list of acceptable publishers), which would give an easy algorithm but would miss edge cases
- Getting behind "review", as used for example on PubMed Central, to work that's been done on the formal ontology of reviews.
These matters are all mainstream for the WikiMed project: they are reflections of everyday discussions. In short, we will start to engage medics by gaining an appreciation of their everyday concerns about good referencing. Among the endorsers of the proposal are some leading figures in this area.
We also believe Wikidatans will engage and participate with ScienceSource, since it will operate, as a Wikibase site, in a way familiar with them. At an early stage we will want to apply for "federation" status with the Wikidata SPARQL endpoint (this was mentioned in the proposal, but for "continuing impact"); being able to query ScienceSource in combination with Wikidata would be helpful operationally, and open up new applications.
The next step would be documentation of a provisional framework for ScienceSource outputs, followed by the programming of face-to-face and possible online training events.
Comment 5
[edit]- This is very much an iterative project following the previous WikiFactMine project which honestly was not the most impactful one, unfortunately.
Thanks for your comments. We think there are distinctive features of ScienceSource, and these include:
- getting outside the "black box" of a virtual machine on Wikimedia Labs, to a wiki context; and
- making the whole text of papers available in a uniform type of HTML, in line with the strategy discussion of 2017, and contributing to the wider project of moving open access content to homogeneous "scholarly HTML".
Those points are actually of rather different kinds. The first is completely familiar to Wikimedians, in the terms that wiki collaboration is a good thing to have around if you want volunteers. The second is a known major issue in w:scholarly communication. It should be interesting to put side-by-side two mission-statement level aspirations:
- "To work with partners to create better tools and support their knowledge extraction" (ContentMine)
- "We will build tools for allies and partners to organize and exchange free knowledge beyond Wikimedia" (from Strategy/Wikimedia movement/2017/Direction#Implications: Our destination by 2030)
These fine phrases indicate a large degree of convergence, but no one should think these potential hook-ups are going to be plain sailing. ContentMine, in brief, was set up to resolve scholarly communication issues, not to develop Wikidata. The paragraph below the strategy quote is this:
We will continue to build the infrastructure for free knowledge for our communities. We will go further by offering it as a service to others in the network of knowledge. We will continue to build the partnerships that enable us to develop knowledge we can't create ourselves.
That seems, to us, to argue for Wikimedia to make the most of the potential of fact mining, but in any case the term "infrastructure of free knowledge" would apply both to Wikidata and to ContentMine tools.
The perceived weaknesses of WikiFactMine, for example in terms of usability, were in the final report, and have been addressed. Technically speaking, the text mining part of the project is still there, but will be migrated to Python code (from ElasticSearch), which is easily reusable. The kind of co-occurrence used will be more suited to Wikidata triples (i.e. adapted to data mining). Overall, lessons have been learned.
- I don't understand how you're superseding the "black box" mode. In the outputs I don't see anything that goes directly to the wikis, only «Corpus of 30,000 WFM-annotated articles (Parsoid format) on Wikimedia Labs». So, if I understand correctly, nearly all the curation work will happen outside the wikis.
- Is this project essentially about creating an automated dictionary or summary of the literature? A bit like some recent efforts by proprietary publishers? If so, perhaps this set of pages should be published as a Wikibooks book. On Wikibooks, it will be editable and then users will be able to easily copy text to Wikipedia if needed. In the proposal I don't see a process for the outputs to actually be put into use, just a hope that they will magically happen to be. --Nemo 13:43, 29 April 2018 (UTC)
Comment 6
[edit]- The participants has the skills to develop the project, but there is a lack of explain us the resposability for everyone involved in the project (the role of staff, participants, community, volunteers, etc.) The budget seems reasonable, but I didn't see what are the benefits of participate in Wikimania.
Thanks for your comments. We have assembled a mix of talent and experience on the staff. Also, our advisory board includes: prominent figures in two of the key communities; software experts; and an experienced former Wikimedian in Residence in the medical area. They will be able to guide the project, provide further contacts, and suggest technical solutions.
In terms of project team roles, we will have
- Jo Brook, Software development and UX and work package owner
- Peter Murray-Rust, Technical Director
- Charles Matthews, Wikimedian in Residence and work package owner
- Guilia Arsuffi, community and outreach support
- Cesar Gomez, project management and work package owner
- Jenny Molloy, Science Director.
Some of the team participants will be resourced at zero cost to the project.
The following concrete benefits were gained from Wikimania 2017: at the Hackathon, adaptation by Magnus Manske of the PagePile tool for ContentMine dictionaries, a logo, workshop given by Tom Arrow; presentations at the Med Day by Charles Matthews and in the conference by Tom Arrow; contact with users Vaclav Dostal and JenOttawa (Cochrane Canada). A stall in the Community Village was manned by Charles for three days.
Comment 7
[edit]- No problems here although there may be some questions over the budget.
Thanks for your comment. For this project, our plan is to provide a monthly expenditure report to the grant officer. All the team members' costings are based on the actual cost to ContentMine. For travel expenses and dissemination activities, the figures are estimates.
Comment 8
[edit]- The project is feasible and participants are skillful. My major concern is the efficiency of the budget: that's not an efficient budget for a mere creation of article items on Wikidata, but would be a good budget in case a focus would move to improving content items.
Thanks for the comment. The budget does need to be applied to improving content. The metrics proposed for the project of course can be revisited.
It is not intended to count items created for scientific articles (now running at around 15M on Wikidata), but "missing data" from such items will be registered by the attempt to apply the underlying algorithm. In other words, when a potential reference is being checked automatically to see if it is suitable, it may be rejected because some basic piece of data such as the year of publication is absent. When the software raises a flag because the check fails, action can be taken to fill the gap.
The algorithm itself should actually be regarded as one of the key deliverables from the project. It should include the "edge cases" from current practice, e.g. including information on how acupuncture can be referenced to reliable sources (this is an old chestnut for WikiProject Medicine). That example shows that alternative and fringe medicine can be challenging, but at the same time Wikipedia is expected to provided information on those treatments that is scrupulously referenced. Current practice excludes nearly all the sources in the area that lay people are likely to encounter.
Comment 9
[edit]- The project focus is well defined and it has a lot of support -past experiencies has a great resposability of this-. I don't see if there are plans to expands to other languages or create a collaboration with other kind of projects.
Thanks for your comment. We believe we have had broad community support, in part, because the proposal was developed by consultation at the end of 2017, and reflects views in key editing communities. Our proposal lays out a project that is manageable, over 12 months, as a combination of software development, communications and community-building, selecting a set of key papers from the biomedical literature, and formalising the good practice of medical referencing on Wikipedia. For obvious reasons, some practical limitations have to be set to the ambitions to do more, but the wider community will have a chance to give their input into the scope of what is attempted.
For expansion to other languages, the effectiveness of the text mining techniques in a language other than English depends firstly on the completeness of the Wikidata labels, and aliases, for items such as diseases. There is no particular reason why ScienceSource technology should not then be applied, once that fundamental translation task is addressed.
Comment 10
[edit]- The community engagement is significant but will it be enough for the success?
Thanks for your comment: community engagement is key for us and the success of ScienceSource. Based on the consultation work already done, we believe that the underlying topic is of major interest to medical editors. The topic of annotations, with which the project innovates within Wikimedia, also has numerous enthusiasts. Here are some details of our approach.
The project will get started with a pilot on WikiJournal articles, which begin life in a wiki format. By adding value to WikiJournal text, including annotations, we can help build a community around ScienceSource.
In simple terms, facts mined from those WikiJournal texts will already be very close to facts to be found in Wikipedia, and will be referenced to reliable sources. Those sources can be used as references on Wikidata (while using Wikipedia to reference Wikidata statements is deprecated, because circular referencing can become a problem). Annotations of WikiJournal add value (and the static pages avoid the issues with annotating unstable Wikipedia articles). But in the other direction, the automated checking of the referencing will provide, in some cases, feedback to authors, leading to actions that replace unsuitable references.
The WikiJournal model shows great promise as a future, more reasonable replacement in the science area for the "featured article". The WikiJournal community engages both authors and peer reviewers, in quite large numbers and grew fast in the past year. The reviewers are largely non-Wikimedians, and we see working with WikiJournal as a route to interesting scientists in what we are doing. Figures provided to us show that this community is drawing in numerous authors (over half of them) who were not previously Wikimedians, and the medical topic areas include: neurology, radiology, virology, parasitology, pharmacology. For these reasons, we are going from the start to try to engage these medical authors in ScienceSource, as part of our plan.
Comment 11
[edit]- There is a community interest to this project, but not sure there is enough community participation as that was a weak point of the previous WikiFactMine iteration (fell short of usage targets).
Thanks for the comment; we think the community interest will start from the involvement of two community champions (WikiMed and WikiJournals). To build on that, we will take from the recommendations of our advisory board, inputs from volunteers, the engagement from our communication channels and the lessons learned from previous projects. ScienceSource will offer a much broader range of options for participation, compared to WikiFactMine, and we see that as a recipe for diversity. In particular annotation is a versatile tool.
In more detail, we look for a community that starts with:
- Wikipedian medical editors;
- WikiJournal community members, particularly medical authors; and
- Wikidatans.
On the ScienceSource wiki, people will be able to discuss, for example, which open access papers to download. The annotations will be handled by Wikibase, so Wikidatans will able to propose properties, not available on Wikidata, that are rather more specialised to scientific needs (for example, for experimental methodology). What types of annotations are permitted would be a community decision. "Placeholder" items for closed access papers would be permitted, allowing easier mapping of Wikipedia referencing into ScienceSource.
There are areas where dictionaries for fact mining are not so easily created (e.g research funders), that could be built up via custom forms of annotation. Custom anchor points in papers, certainly, can be useful, if you want to link into them. It should also be mentioned that ScienceSource is deliberately named in a way that suggests WikiSource, where the texts of open access papers have had difficulty gaining acceptance, as have guidelines for annotation.
The software requirements for (i) downloading papers, and (ii) translating PDFs into HTML, allow for collaboration, and given the right architecture, coders would be able to contribute in Javascript and PHP. Summing up, a much broader range of wiki-style activity can be envisaged.
Additionally, we will try to bring editors from other communities, such as SPARC. Setting up a green-field site for annotation of science papers appears to us to have attractive features.
- Why do you plan such discussions on which papers to download etc. to happen on a separate wiki? If those discussions have an importance to the final goals (improving the content of Wikimedia wikis), they should happen on a Wikimedia wiki. Otherwise this sounds like we're just funding yet another wiki out there. --Nemo 13:45, 29 April 2018 (UTC)
Comment 12
[edit]- I am slightly in favour to support it but in my opinion the travel cost must be cut. There is no sense to propose this project to wide audience (like Wikimania) while the project can have more acceptance to specific conferences. The concepts behind that project are very technical and a travel to Wikimania doesn't generate a real impact.
Thanks for your support. We can contemplate cutting down the travel cost, or making the company assume part of that cost, to make this project work. The grant rules do not allow for claims for travel to conferences outside the Wikimedia family. As we presented in detail at the grant interview, ContentMine attends 10-15 conferences every year, some of them directly related to the medical community.
As it turned out, Wikimania proved fruitful, in 2017, for important contacts in the medical and molecular biology areas. A talk proposal has been made for Wikimania 2018, in Cape Town. The area chosen is not fact mining, but the issues involved in downloading and using the open access literature. In other words, it concentrates on the inputs to ScienceSource. This topic is in line with the Wikimania theme ("gaps in knowledge"); and fairly clearly should interest anyone trying to study science, outside major academic institutions.
In order to make this technical project more accessible for non-technical community, our team will run a mini-conference in the second half of October (a local event, in Cambridge UK, for the Wikidata sixth birthday). Some resources would make possible a more ambitious event. A combination of documentation, in-person events and use of online channels, is part of our community engagement plan.
Comment 13
[edit]- I am not sure that continuation of this fact mining activity (which is quite expensive) is beneficial. Millions of facts will be mined but then left unused because competent editors will not need them but those editors who are less competent will be unable to use them. I am also not persuaded by the final report of ContentMine that those money were well spent.
Thanks for your comments. At ContentMine we have faith in the benefits this technology can bring to the wider community. Text mining is a standard process, and has been for a generation: it is considered to have a close relationship with machine learning. Annotations are now being widely used on the scientific literature (for example by Hypothes.is, Europe PMC) to add value to scientific papers.
But up to this point, these techniques have not been closely allied with the "semantic web" approach, for which these days Wikidata is the first port of call. The kind of search that ScienceSource is intending, is to extract well-referenced facts in the subject-object form required for statements on Wikidata. The "mining" metaphor is perhaps a bit misleading. It is not heavy work to find the elementary components of fact mining. It begins simply with recording the results of a search for a given word or phrase.
The role of being "competent" is given to the human checker, who annotates the fact to validate it semantically: humans identify the meaning in the scientific paper. Validated facts are then exported automatically to Wikidata. We can identify division of labour here. The infobox mechanism means that someone with localization or Lua skills can help in placing the fact into the Wikipedia in a given language; while the subject expert has given a verdict upstream.
To sum up this description: the process is a kind of massively parallel search that is semi-automated, in that human checking of the content of an article is required. This check takes the form of signing off that what is written in a scientific paper is correctly being written into machine-readable form.
The problem with the expression "millions of facts will be mined" is that the millions refer to finding instances of terms like "leukemia" or "valium" in scientific papers. Those cost only some CPU cycles and memory. The human checking is what should attract attention.
So let's redescribe the process more usefully. The amount of information provided by ScienceSource will be much greater than that which the average reader can hold in memory. It simply saves vast amounts of time to be able to search the literature for "all cancers", and then combine that search with some other dictionary, say of drugs.
This topic, of "co-occurrence", is not necessarily well understood. The type of fact mining proposed is one that could find papers where some type of cancer is mentioned, within say 100 characters of some drug of a particular group. Some experts might be able to recall having seen such a paper, but in practice automated search is the powerful tool to do this. That is not all, of course. What is truly powerful is the semi-automated method, where a human then checks the information (no problems with ambiguities, and that the semantics is clear, namely that the drug is actually a recognised treatment for that cancer). What really matters is better software helping the human to know where to look for such facts. Once the problem is posed in these terms, say with a GUI, then it becomes more obvious that the issue is much more about usability than anything else. Resources are needed to get scientific papers into suitable form, and to present the search results more usefully. That's the nubbin of the actual problem.
Comment 14
[edit]- I would support funding only in case of significant focus on adding these extracted articles and facts as references to Wikidata statements, as well as community engagement around usage of this software. Specifically I would suggest setting much higher measures of success on statements added or referenced on Wikidata on items other than scientific articles: with just 500 target (and I don't know how many of them are articles and how many are other items) it gives us fantastic 200 USD / statement. I can't say what is a good number but I would suggest linking funding to better targets on a) statements added or referenced (on items other than scientific articles) as a content improvement target, b) users involved in adding this references (not in adding articles) as a community engagement target.
Thanks for your comments and analysis. We agreed to focus our targets on (a) statements added or referenced and (b) users involved in adding those references.
The metrics initially chosen for ScienceSource took into account that sheer volume was achieved in the previous project, but that ScienceSource is about other considerations. In particular the algorithm for checking sources against the guideline for "reliable medical sources" itself is one key deliverable. Once it is there, it can be used in other contexts, most obviously on Wikipedia to weed out poor referencing. The text of 30K significant papers in a standard HTML form is another deliverable. It should be of value to third parties, and as a starting point for other projects. Care would be taken of these matters as the project went along. Cost per fact ignores those matters.
The given number 500 of the human annotations could be seen as a bare minimum to carry out proof-of-concept in this setting: that could be taken as fair comment on the way the metric was set. The annotations that should be counted are those that actually led to well-referenced facts that passed to Wikidata, through the filtering by the algorithmic form of the medical reliable sources guideline.
We expect the algorithm initially to fail most facts, but it has to be understood that the algorithm is better if it is tougher (more restrictive on referencing, uses a better and more stringent definition of "secondary literature"). It represents, in software form, the clinicians' judgement on how patients can be treated, so should be conservative.
Round 1 2018 decision
[edit]Congratulations! Your proposal has been selected for a Project Grant.
The committee has recommended this proposal and WMF has approved funding for the full amount of your request, US$99,640
Comments regarding this decision:
The committee is pleased to support ScienceSource, with inclusion of the following conditions, per prior discussion with the grantees:
- Reference data:
- ScienceSource will add reference data with high quality sources, and verify data and add references where absent.
- ScienceSource will focus on the most important medical facts in recent systematic reviews.
- Communications Plan:
- Team will submit Communications Plan to Marti Johnson and Lydia Pintscher for review at least three weeks before execution, so feedback can be incorporated before work begins.
- UX design
- At beginning of planning stage for UX design, team will contact Marti Johnson to schedule consultation with Wikimedia Foundation UX design staff.
New grantees are invited to participate in a Storytelling Workshop on June 5 and a publicly streamed Project Showcase on June 14. You can learn more and sign up to participate here: Telling your story.
Next steps:
- You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
- Review the information for grantees.
- Use the new buttons on your original proposal to create your project pages.
- Start work on your project!
Upcoming changes to Wikimedia Foundation Grants
Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.