Jump to content

Talk:Wikicite/grant/Adding support of DBLP and OpenCitations to Wikidata

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 3 years ago by Csisc in topic Use of Deep Learning Algorithms

@Csisc: Thank you for making this proposal. I realise that it is still marked as a "draft" and you are still writing the details. But, I wanted to write a comment early:
This is a project to mass-add content to Wikidata, therefore, for the WikiCite grant committee to fund it, it would be important for us to see that you have community support for this activity. It is impossible to "prove" community support, of course, but it would be very important for you to notify relevant places inside Wikidata community discussion forums about your plans, and invite them to endorse/comment on this proposal. Thank you, LWyatt (WMF) (talk) 12:18, 25 September 2020 (UTC)Reply

LWyatt (WMF): Thank you for this advice. I will certainly do that. --Csisc (talk) 12:34, 25 September 2020 (UTC)Reply

Bot appproval consensus at Wikidata[edit]

  • The last bigger bot approval for importing papers was made under the assumption that Wikidata infrastracture can handle the bigger imports. At this point in in time Wikidata is at it's capacity, so it's very unclear to me that a bot approval for doing this task would succeed.
While I would like that Wikidata could easily handle as much data as we want to put into it and have all papers in Wikidata, I'm skeptical about us having the technical capacity for that at the moment. Given that there's a good chance that this won't lead in an approved bot, I think bot flags should be sought before the project gets funded. ChristianKl19:43, 25 September 2020 (UTC)Reply
@ChristianKl: There is a huge lack of scholarly citations in Wikidata (i.e. P2860 relations). Wikidata has succeeded to handle bibliographic information for scholarly publications. It can succeed to support citation information as well. Wikidata items have DOIs and OpenCitations and DBLP records also include DOIs. So, there is no matter in integrating scholarly information from DBLP and OpenCitations to Wikidata. --Csisc (talk) 15:48, 26 September 2020 (UTC)Reply
I think it's misleading to count a project to import data that was approved with the assumption of not producing problems but that did produce problems as a success. The bot you propose might get bot flags but it also might not get flags. ChristianKl21:55, 27 September 2020 (UTC)Reply
@ChristianKl: I thank you for your answer. There are several reasons for the approval of the two bots:
  • Many similar bots mass uploading bibliographic information from CC0 databases have already been approved.
  • We are not just mass uploading the data. We are programming the two bots for a given purpose: Enriching Wikidata with unsupported bibliographic information.
Yours Sincerely, --Csisc (talk) 09:46, 30 September 2020 (UTC)Reply
My main point is that previous bots for mass uploading data have been approved under assumptions that currently don't hold. Thus it makes sense to seek bot approvals before seeking funding as the community might or might not approve the bot. ChristianKl22:29, 2 October 2020 (UTC)Reply

Questions: WMF buying you a computer?[edit]

I don't get it. You're going to use the money to buy a computer with an internet connection? Why don't you just use Toolforge or the Wikimedia Cloud? Why do you need a "High-Performance Computer with GPU and CPU"? Multichill (talk) 19:46, 25 September 2020 (UTC)Reply

  • Why do you need a GPU? Have you compared the pricing to just using cloud computing? 3500 USD would get you a beefy machine for the one month you are planning on running.
  • Why does an internet connection for one month cost $500? Is this some special connection?
  • Will the source code be released under an open/free license?
  • How many new items will you be making? Will those just be for authors? Or also publications?
  • How will you actually do the joins between the various databases? What error rate do you anticipate?

BrokenSegue 15:06, 26 September 2020 (UTC)Reply

@Multichill and BrokenSegue: I thank you for these questions. What we will do is to build an infrastructure to mass upload OpenCitations and DBLP databases to Wikidata during five months. This infrastructure will not cease to work after the end of the project. Bibliographic information is ever growing and there are new publications every day. The infrastructure will certainly be used for years. Renting a cloud service is not an excellent option as the fees will be more expensive than buying a high-performance computer by two years. The data is also too big and can consequently overwhelm the Wikimedia Cloud Services. As you already know, Wikimedia Cloud got blocked several times during this summer due to an excessive number of requests. So, the Internet Connection is for two years and not only for one month and the high performance computer will be used for years too. Concerning your question of why we need a high performance computer for this work, we need develop a deep learning architecture to infer new knowledge about the enriched items from DBLP and OpenCitations (e.g. Main subjects, Affiliations...). We need the GPU and CPU for that. --Csisc (talk) 15:36, 26 September 2020 (UTC)Reply
@BrokenSegue: Concerning your other interesting questions, we will release the source code of the bots under CC BY 4.0 License (https://creativecommons.org/licenses/by/4.0/legalcode) in the GitHub repository of our research group (https://github.com/Data-Engineering-and-Semantics) so that they can be reusable in other interesting WikiCite projects. The bots will add certainly millions of scholarly citations and thousands of computer scientists to Wikidata. These bots will also add research publications to Wikidata if missing. The bots will use the DOIs to match between DBLP, OpenCitations and Wikidata. --Csisc (talk) 15:43, 26 September 2020 (UTC)Reply
@Csisc: Ah, ok thanks for the answer. CC-BY-SA is an odd license choice for source code. I believe CC suggests you use other licenses for software. When you say you will infer "Research Topics, Affiliations" which wikidata properties do you plan to infer and for what kinds of entities (humans/papers/etc). Also what will you be using as your training corpus and what level of confidence will you deem sufficient to import. BrokenSegue 15:54, 26 September 2020 (UTC)Reply
@BrokenSegue: I thank you for your answer. I can use MIT License or GNU Licenses for the source codes. If MIT License seems to more adequate than CC BY 4.0, I will use it. The Wikidata properties that will be inferred are P921 for research publications, P101 and P1416 for scientists, and P101, P1416 and P355 for institutions. The training corpus will be the scholarly relations already available in Wikidata. We will also use techniques that do not require a training corpus such as Latent Dirichlet Allocation and Semantic Similarity. --Csisc (talk) 16:05, 26 September 2020 (UTC)Reply
@Csisc: Yeah, I think MIT/GPL would be more appropriate. Those properties seem reasonable. It's true that LDA is unsupervised but you will need some supervision to map the topics to wikidata items and some intuition to know what thresholds to use to filter these results. Also, a lot of the labels already present in wikidata were made using simple string matching techniques which might make it hard to use. My main remaining concern about methodology is the error rate we are willing to tolerate in these inferences. But I imagine that's non-trivial to figure out ahead of time. BrokenSegue 16:21, 26 September 2020 (UTC)Reply

Questions about data quality and scope of proposal[edit]

A few questions about the data you propose to import. The Wikidata Query Service make SPARQL federated queries to retrieve data from the OpenCitations endpoint. Can you explain why you have decided to mass import data from one of the external sources we can access via federation? Given the capacity limitations that ChristianKl has highlighted above, this import could potentially contribute to problems and yet it does not appear to give us anything we cannot already get. As you intend to significantly improve the coverage for computer science publications and researchers please can you describe the current level of coverage?

Regarding DBLP, I notice the data has owl:sameAs relationships to Wikidata. From your experience working with the DBLP dataset do you know how prevalent these relationships are? Further, I am curious to know whether you had evaluated other sources of bibliographic data before selecting DBLP and OpenCitations for import? Specifically, can you be certain that enough new knowledge will be generated to justify the spend on high powered computing? In my opinion, it is very likely that the main subject, affiliation and even citations are already available to extract from existing sources and could be imported. For example, Microsoft Academic Knowledge Graph contains a huge amount of data.

To better understand this project, I would like to see specific answers to the questions from BrokenSegue above concerning the data that will be created. In addition, it would be useful to know whether there will be any validation of the data created? From my experience of creating bibliographic data and working with the corpus in Wikidata, it is not uncommon to encounter data quality issues, especially in large batch imports, and these issues often remain undetected for significant periods after creation.

Finally, in my opinion, 100000 edits is not a successful outcome for a five-month project with a $4000 budget. For context, to import every paper published in Lecture Notes in Computer Science (Q924044) would require around 300000 edits at the very least. Simon Cobb (Sic19 ; talk page) 18:19, 27 September 2020 (UTC)Reply

Simon Cobb: I thank you for your comments. Concerning your six questions, I am honoured to provide these information:
  • Use of query federation: The use of federated queries will let the citation queries exceed the timeout limit of the endpoint, particularly when we are dealing with the scholarly citations of multiple publications. We already tried that. However, we did not have interesting results. We can study this another time if the timeout limit of Wikidata Query Service will be significantly expanded. That is why we decided to mass import OpenCitations and DBLP to Wikidata.
  • OpenCitations Coverage: We tried OpenCitations on several of our works, we found that OpenCitations can return twice the number of citations for a single publication in Wikidata in the medical context and six times for a single publication in Wikidata in computer science.
  • DBLP: The alignment of DBLP to Wikidata has recently appeared. It covers quite all the authors indexed in Wikidata. But, it still needs several brushings for publications and venues. Just for information, I demonstrated Wikidata during BIR Meeting 2020 three months ago and the creators of DBLP were also there. I think that they caught this interesting idea from there.
  • Evaluation of other resources: We certainly evaluated other interesting resources before choosing OpenCitations and DBLP. In fact, the choice was based on multiple significant factors: CC0 Licensing, Availability of RESTful API, Work Experience with Resources...
  • DBLP Features: DBLP has a better author disambiguation than any other resource. That is why it will be interesting to ameliorate the coverage of computer scientists in Wikidata. As well, we found that there are several millions of CS publications that are indexed in DBLP but not in Wikidata.
  • Data validation: Of course, we will not mass import the data without preprocessing it. We will check that the items correspond to the DBLP or OpenCitations records before adding the statement. Simple constraints can be useful for this. For example, If (DOI = DOI) and (Title = Title) then add the information. We will also use SPARQL queries to align between resources.
  • Concerning the 4000 USD: Please read the proposal, I have clearly explained that 4000 USD are not for the deployment of the project during five months. The five months are used for the development and testing of the bots. The fees are used for letting the two bots work for years. Scholarly research will not finish after five months and we will need to add new scholarly citations when they appear.
  • The threshold: We cannot set a high threshold as we do not know how the bot flag requests and other issues will be solved during our work. I think that this threshold is acceptable for testing the two bots.
Yours Sincerely, --Csisc (talk) 12:11, 28 September 2020 (UTC)Reply

Use of Deep Learning Algorithms[edit]

On the proposal page you briefly state: The project will make use of deep learning algorithms to generate new knowledge (Research Topics, Affiliations) from the extracted ones and consequently to further enrich Wikidata with bibliographic information. This motivates the purchase of a 3500 USD computer, accounting for over 85% of the grant money. Therefore I have a few questions regarding the planned use of machine learning:

  1. You will generate research topics for papers and affiliations for people based on the research papers, correct?
  2. What are the relevant papers and tools for this task you plan to build upon?
  3. Deep Learning requires training data, how is this data acquired?
  4. How are the deep learning models evaluated?
    1. Which precsion and recall do you expect? (based on the existing research)
    2. How many wrong topics and affiliations do you expect the model to generate (and to be added to Wikidata)?
    3. Many machine learning models have been found to be biased in various ways. How will you mitigate bias?

--Pyfisch (talk) 21:03, 27 September 2020 (UTC)Reply

Pyfisch: I thank you for your comments. I will answer your significant questions one by one:
  1. You will generate research topics for papers and affiliations for people based on the research papers, correct? Yes. Definitely True.
  2. What are the relevant papers and tools for this task you plan to build upon? Our team has worked on various interesting semantic similarity and word embedding techniques. For example, https://www.sciencedirect.com/science/article/abs/pii/S0952197614001833 defines an ontology-based semantic similarity measure. https://link.springer.com/chapter/10.1007/978-3-319-19644-2_43 defines a sentence-level semantic similarity measure. A survey about this can be found at https://www.sciencedirect.com/science/article/pii/S0952197619301745. Concerning deep learning algorithms, they will be mainly used for automatic classification. Further information can be easily found at https://link.springer.com/article/10.1007%2Fs11192-020-03474-w.
  3. Deep Learning requires training data, how is this data acquired? There are a large scholarly information in Wikidata. We will make use of it as training set.
  4. How are the deep learning models evaluated? We have developed a semantic-aware evaluation method for automatic classification. It is under review. So, I cannot explain more.
    1. Which precsion and recall do you expect? (based on the existing research) You can see https://link.springer.com/article/10.1007%2Fs11192-020-03474-w for further detailed information.
    2. How many wrong topics and affiliations do you expect the model to generate (and to be added to Wikidata)? We are still experimenting various methods. We cannot absolutely conclude about the accuracy of the model.
    3. Many machine learning models have been found to be biased in various ways. How will you mitigate bias? We will absolutely do that using logical constraints. The algorithms will involve several conditions to eliminate bias.
Yours Sincerely, --Csisc (talk) 12:28, 28 September 2020 (UTC)Reply
Thanks for your thorough reply! I am not convinced that the information generated by automatic classification will be high-quality enough to include in Wikidata. While Wikidata contains a large amount of scholarly information I am not aware it adheres to a particular standard of quality and completeness. (In fact the motivation section describes Wikidata as a "distorted mirror".) This makes Wikidata in my opinion unsuitable as a gold standard for your machine learning models. Approximate matching-based unsupervised document indexing approach: application to biomedical domain states precion between 0.8 to 0.6 and recall between 0.2 to 0.5 for the "proposed approach" (depending on the number of concepts). To me these figures indicate that this indexing approach is useful for search engines and document discovery right now but the data should probably not be added to a semi-permanent database. Using logical constraints to prevent bias is not a method I am familiar with but the claim that you "eliminate bias" seems too good to be true considering that NLP at large still struggles to minimize bias. --Pyfisch (talk) 09:02, 3 October 2020 (UTC)Reply
Pyfisch: I thank you for your answer. I did not say that we will directly use the automatic classification algorithm having a precision of 0.8 and a recall of 0.5. It is just a excellent point of start for our project. We will work on the algorithm before applying it to our work. An example of how this algorithm can be ameliorated is the application of the sentence-level semantic similarity measure that our research team have developed: https://link.springer.com/chapter/10.1007/978-3-319-19644-2_43. This measure can be used to compare the title of a publication with the titles of other research publications about the same topic. We can also let topics be validated by autoconfirmed users before inclusion in Wikidata using a Wikidata Game. --Csisc (talk) 13:18, 4 October 2020 (UTC)Reply

Is this grant part of a research project?[edit]

@Csisc: Thank you for your answers to my questions above. This is an interesting proposal that would obviously improve the corpus of scholarly papers in Wikidata but I am still unsure about some aspects it.

Please can you clarify if this proposal is part of a research project at the University of Sfax that your team is working on? You have explained that the computer purchased with this grant will be hosted in the Faculty of Sciences of Sfax. However, the budget does not include staff costs and it is important to know if you are intending to editing Wikidata as part of employment as researchers. Simon Cobb (Sic19 ; talk page) 22:09, 30 September 2020 (UTC)Reply

@Sic19: Data Engineering and Semantics Research Team is working for years to provide free and open computer applications that can be easily reached. Most of the members are doing this for science because they believe that providing free and open tools for computer program development can be useful to build a low cost information system infrastructure for developing countries particularly Tunisia. They are also open science advocates working to open research data in Tunisia and abroad. So, they have rarely applied to get money for working on open science. All the money we have from obtained grants are spent in buying materials, travelling and trainings so that we can work on research projects. The team is currently interesting in enhancing Scholia from a tool generating profiles of researchers and countries to a tool that can return living scientometric studies. Succeeding to create scientometric studies that can be updated in near real time will be a breakthrough in computer science and also in library and information science. However, we found that Wikidata bibliographic information still lack interesting data such as topical coverage and citations and cannot yet be useful for such a promising purpose. That is why we thought of enriching Wikidata with data from multiple sources before working on Scholia. However, being based in Tunisia, we do not have the means to do this work. The computers available in the university are too old and lack substantial care. We do not also have a server. That is why we applied for WikiCite. --Csisc (talk) 23:58, 30 September 2020 (UTC)Reply
@Csisc: Thank you again for your answer. Firstly, let me assure you that in asking this question I am not suggesting that your grant application is anything other than a genuine attempt to make a positive contribution to the WikiCite corpus and open citations in general. We have met in person, and spoke on other occasions, so I do not doubt at all that you are honest and intend to use the computer as you have described. By asking questions, I am trying to help you strengthen your proposal because, initially, it was difficult to understand what the project was trying to achieve.
The specific reason I am asking about the relationship between this proposal and the research you are conducting is there could, potentially, be a conflict of interests. Without transparency, a conflict of interests can be perceived even if there is not an actual conflict. In this instance, there could be a potential conflict of interests if your employment or research activities influence the editing of Wikidata during the project. Conversely, there may be a risk of bias, for example, if the citation data imported during the project were included in analyses as part of a research project. And a project to increase citations in a field related to that in which the project team is active could also be perceived as a conflict of interests.
Or, there might not be a conflict of interest. But it is only possible to know if sufficient detail about the project is provided in the grant application. The motivation on the grant page is only to improve the coverage in Wikidata, which is great, but the motivation described above is great too and very much worth including in the proposal. You're absolutely correct about the gaps in Wikidata bibliographic information and it is also a problem in many disciplines in the social sciences and humanities, for example. Unfortunately, it isn't feasible to import everything at the moment. Not sure exactly what sort of real-time scientometrics studies you hope to create but Dimensions on Google BigQuery was released this week and the COVID dataset is free to access.Simon Cobb (Sic19 ; talk page) 21:27, 1 October 2020 (UTC)Reply
@Sic19: I thank you for your answer. I am certain that you provided your interesting comments in good faith. There is no conflict of interest between this grant and our ambitions as we did not apply for any grant for the development of living scientometric studies. It is a dream that we like to achieve. The purpose of adding citations and other metadata is to increase the coverage of the bibliographic information in Wikidata. Broad coverage of bibliographic information in Wikidata will allow us to test our development of living scientometric studies. Living Scientometric Studies are just Scientometric Studies constituted of SPARQL queries. It represents its introduction, results and discussion in the form of SPARQL visualizations. This allow the update of studies in near real time. The idea is derived from Living Systematic Reviews (https://community.cochrane.org/review-production/production-resources/living-systematic-reviews). Yours Sincerely, --Csisc (talk) 23:28, 1 October 2020 (UTC)Reply