Talk:Massively-Multiplayer Online Bibliography
Add topicSummary
[edit]The description of this project helps me to understand the problem but I fail to understand that this problem can be solved. I am not sure what it is that I do not understand about this and am unable to articulate a question. Is this a proposal to add categories or tags to every kind of media which exists? Blue Rasberry (talk) 15:33, 19 August 2013 (UTC)
- Not quite: it is for volunteers to browse and read (random or self-selected) essays and articles on the open Web and to classify them using either controlled vocabularies (such as the LCSH, see the main page) or Wikidata (Wikipedia) items. I have now added a clearer statement of the proposed solution as well as an example. Please take a look and tell me if it's clearer. Thanks! Asaf Bartov talk 08:18, 20 August 2013 (UTC)
Scope
[edit]Why only "aboutness"? --Nemo 17:10, 19 August 2013 (UTC)
- MMOB is envisioned as an umbrella for several projects, not just the Aboutness Project. Some projects will be infrastructural (e.g. the Table of Contents of Everything, which I am yet to articulate), some more high-level, like the Aboutness Project. The common thread is crowdsourced human intelligence in the service of AI-complete bibliographic tasks, on the open Web and using open standards and free software.
- That said, as we develop this idea, it may well make sense to collect more metadata while we're at it. I focused on "aboutness" and on individual articles and essays because it strikes me as a sweet spot in terms of clarity of the task for the volunteer ("What is this text about?" [given searchable navigable data sets to pick from] is relatively easier than expecting volunteers to figure out the niceties of edition details, for example, or adaptation-of/adapted-from relationships.) and relatively-uncovered need (essays are "invisible" in almost all catalogues, as explained on the main page).
- Does this make sense? Asaf Bartov talk 09:00, 20 August 2013 (UTC)
I guess, you already answered my qestion on scope as you say the initiative focuses on "individual articles and essays". For the sake of clarity, I nonetheless post the question: Will this project focus on textual resources? On printed resources? Or does it aim at generating "aboutness" statements for any cultural work (e.g. photos, digitized paintings, photos of sculptures, digitized manuscripts etc.)? Acka47 (talk) 08:02, 21 August 2013 (UTC)
- First of all, thanks for all the helpful edits on the project page! To your question: certainly, the basic idea can apply to any kind of work or resource. However, since this is a bottom-up initiative, aiming at low-hanging fruit and well-understood issues, I'd tend to stick to the digital text space for a start, and once we have something going and learn from it (as one learns from all babies...), we can be bolder and expand or generalize the scope (and whatever software/sites we end up producing) to encompass these other types of works. Asaf Bartov (WMF Grants) talk 08:32, 21 August 2013 (UTC)
I see "about" as one facet, but given the vast number of resources in existence, I'd like to see something that tells people more about the works and the relationships between the works. If I'm starting to explore a new topic, what should I read first? What should I read first if I am a 12-year-old vs. if I am a scholar in a related topic area? What of the resources listed had the greatest influence on later works? Which were best-sellers? Which won awards? Which are used frequently in college course syllabi? We have lots of bibliographic data with at least some topical indication, but nothing that helps you navigate based on your own needs.LaMona (talk) 15:07, 17 October 2013 (UTC)
JISC Open Bibliography 2 project
[edit]I believe we should be aware of this project, and that we should take care of not needlesly double any effort. While the JISC/OKFN seems to focus on metadata about scholarly articles, citations etc., the MMOB/Aboutness idea seems to be broader and is mainly focussing books, is that right? --Lambo (talk) 01:03, 20 August 2013 (UTC)
- Thanks for this reference! Is that project active at all? I'm happy to collaborate with absolutely anyone interested in open bibliography!
- Poking around, though, it seems a little abandoned :(. The Bibsoup looks empty ("not found" on several I tried), the bibserver.org domain seems to have expired, and most of the "work packages" seem to have ground to a halt a little more than a year ago, including the Bibserver codebase. However, it may well be something to build on, once we get down to implementation! So I absolutely would like to collaborate with JISC, OKFN, and the openbiblio folks. I have also independently invited Karen Coyle, whom I've had the pleasure of collaborating with briefly on the W3C Library Linked Data Incubator Group a couple of years ago, to take a look at MMOB.
- Finally, there does seem to be a clear difference in focus -- the work I'm seeing in the JISC work is focused on consumer bibliographic information, i.e. personal reference lists and bibliographies, whereas my focus is on building a general public resource, more like a library catalogue than a personal data set. Asaf Bartov talk 09:00, 20 August 2013 (UTC)
Internet Archive
[edit]The Archive has copies of digitized books, including most public domain books from the Google project. Most books have a library MARC record associated with them, which includes classification numbers and subject headings. These can be extracted from the data dump or by using the API. There should be about 1.2 million public domain books, but there are also many others that are not PD.
If there is interest in bibliographic data alone, then the Archive has about 8 million Library of Congress records. Also, the DataHub has about 90 bibliographic datasets available, including the British National Bibliography, and data from a number of different national libraries. LaMona (talk) 16:13, 20 August 2013 (UTC)
- Certainly, the Archive is an excellent source for book-level texts. We may well make use of it as a primary source for the Table of Contents of Everything project. For the Aboutness Project, however, I'd like to focus on the (much smaller) pool of work-level URI addressable texts, and the Archive is less useful for that. Asaf Bartov (WMF Grants) talk 04:27, 25 August 2013 (UTC)
Mendeley
[edit]Not sure if this is useful, but Mendeley has a webservice from which you can obtain human-curated tags for around 80M documents. The data are openly licensed under CC-BY. The data are just available as JSON, not in any official metadata standard format, but could be a useful scaffold for seeding efforts. — The preceding unsigned comment was added by Williamgunn (talk) August 22, 2013 (UTC)
- Thanks, William! That's certainly appealing. Am I correct in understanding one must "register an app" to get data from this service? Asaf Bartov (WMF Grants) talk 20:23, 22 August 2013 (UTC)
- His, reply, via Twitter: "We just ask for a name and email so we can meter API use, similar to Pubmed." Klortho (talk) 22:04, 24 August 2013 (UTC)
- Yes, we give out API keys, not to control who can and can't access, but to do rate-limiting so that everyone who wants to access the info is on a level playing field. At the moment, we don't have the resources to give a 15 TB dump file to everyone who wants one, so if that makes us uneligible, so be it, but just sharing in case it helps.--Williamgunn (talk) 16:28, 26 August 2013 (UTC)
Wikipedia library and OCLC
[edit]There seems to be a fair amount of overlap in scope between what you are doing and some of the things described on the w:Wikipedia:The_Wikipedia_Library/OCLC page, but I don't see any overlap in participants. Are you guys talking? Klortho (talk) 19:22, 24 August 2013 (UTC)
- Actually, I don't see much overlap. As far as I can tell, that page is mostly about access to existing resources. Here, I'm talking about generating metadata for existing resources. What overlap do you see? Asaf Bartov (WMF Grants) talk 04:30, 25 August 2013 (UTC)
- Perhaps you're right. I guess the overlap is really between OCLC itself, rather than that group of Wikipedians. I don't know much about it, but maybe w:WorldCat is a resource you could integrate. Klortho (talk) 14:10, 25 August 2013 (UTC)
Comments from Edsu
[edit](pasted from e-mail on libraries@lists.wikimedia.org)
It's an interesting idea, thanks for throwing it out there. Just to play devil's advocate a little bit, aren't most of the citations and external links in Wikipedia articles assertions of "aboutness"? Edsu
- Some are, to be sure! And it would be an interesting path to explore to try to figure out how much and how useful it would be to "seed" or "suggest" (to human volunteers) classifications for essays in this semi-automatic way! Asaf Bartov talk
How is what you are proposing different? Edsu
- I am proposing a human volunteer reading an essay (begin with the text), then (conveniently! a UI/usability challenge!) selecting topics from controlled vocabularies and Wikidata item titles and asserting the essay is about those things. Asaf Bartov talk
For example, from the English Wikipedia Article for Friendship you could derive the following RDF assertion: Edsu
<https://en.wikisource.org/wiki/Essays:_First_Series/Friendship> dcterms:subject <http://www.wikidata.org/entity/Q491> .
- Yes, but equally, from the same "Further reading" section, your algorithm would assert Aristotle's Nicomachean Ethics is about Friendship, which would be misleading -- friendship is certainly a theme in that book, taking up perhaps 15% of the discussion, but it would be misleading to assert the entire book is about friendship, unless you also assert all the other topics it includes (virtue, moderation, the examined life, etc.).
- That very issue is another reason I'm focusing on individual essays and articles for this particular project (note that the MMOB vision is very broad and foresees multiple projects, some infrastructural and some higher level, all long term and ongoing). Complete works (such as the Ethics) are already catalogued and classified reasonably usefully by traditional catalogues. The goal here is to extend this to the vast space of essays, which are on the one hand nearly invisible to topical searches (as distinct from full-text searches), while on the other hand usually confined to one (or rarely two) clear topics, making the human classifier's work simpler. Asaf Bartov talk
I guess answering my own question a bit, perhaps it could be easier for people to make these assertions as they are reading material on the web...and that perhaps not all of them belong in the citation or external links sections of Wikipedia articles? Some articles could get a bit long and unwieldy. I remember a social bookmarking site called [www.faviki.com Faviki] that uses Wikipedia as a controlled vocabulary for tagging content while bookmarking it. Is that similar to what you are thinking about? Edsu
- Hmm, yes! Thanks for this reference! Yes, Faviki is very much along the lines I'm thinking about. The obvious difference I see, having only read Faviki's about page so far, is that it classifies arbitrary Web pages (rather than a well-defined set of works), i.e. is broader in its target scope, and relies only on dbPedia concepts, which is narrower than the combined authority-files-and-Wikidata approach I have in mind. It's also not immediately clear where the data resides, how re-usable it is, etc., but perhaps further inquiry will reveal this. But again, this is very much the direction I was thinking of. It would be interesting to see if the Faviki maintainer would be interested in joining this conversation. Thanks again for engaging! Asaf Bartov talk
Comments from Kosboot and LA2
[edit](pasted from e-mail on libraries@lists.wikimedia.org)
It's interesting to me that databases like JSTOR don't use subject headings except with regard to the discipline of the journal where the article first appeared. Kosboot
- Wikisource does have access to volunteers, but the individual articles in Popular Science Monthly and other journals or magazines aren't being systematically cataloged and indexed (or categorized) as they could. This is because our supply of volunteers is not infinite, even if the project is open to anybody.
- Similarly, Wikipedia is quite large, but only in very few languages. In most languages it is quite small, because of the limited number of volunteers. LA2
- Agreed, volunteers are not infinite. However, volunteer motivation is a flexible quantity, and does depend on the quality of the tools provided to them, and the clarity of the value proposition of the work they are invited to do. I'd argue the main reason Wikisource material doesn't get cataloged, categorized, or tagged as much as it could be, is that no compelling tool/system has been put in front of Wikisource volunteers (and prospective volunteers too!) so far. Asaf Bartov talk 05:34, 21 September 2013 (UTC)
Questions from Aubrey
[edit](pasted from e-mail in libraries@lists.wikimedia.org)
I think that Asaf's idea is very interesting, but of course my ultimate and neverending goal is to have Wikisource being a part/partner of it :-)
I have very unclear ideas about this, but:
- couldn't the project completely rely on Wikidata? You can have an item for (almost) every record: http://www.wikidata.org/wiki/Help:Sources -- Micru (in copy) can explain more about this.
- couldn't we take all the Open Library data? are they CC0?
- how do you see the relationship of this with Wikipedia and Wikisource?
One of the things I think about most is the fact that in Wikisource we actually use some template ad hoc for cited authors and cited works. Example: [1]
Every blue link is a wikilink to another Wikisource work/author page.
Moreover, at the bottom of the page you can see categories that list every citation of every author/work in Wikisource. I mapped this kind of relationship from a "mentions" property from schema.org to a wikidata property (the whole mapping we used as a draft is here: [2])
I think that these templates could convey (in a way I don't know) a "mentions" property in Wikidata: ex. Book Q98 mentions Author Q42, or something like this.
Do we want a "cited thing/concept/item" template? That could link directly to Wikipedia, for example.
In my ideal digital library, this kind of annotations would be made upon a different layer, and not in the wikitext (as we are doing now).
Of course, I can and will discuss about this in the biblio-hackathon we will host at the National Library of Florence in October to the Pund.it folks [3]
Finally, I would recommend to discuss about all these things in our beloved Books task force: [4] :-) Aubrey
- Thanks for these thoughts, and the general enthusiasm! :)
- To your questions:
- 1. Could we completely rely on Wikidata? I hope so! But it's not at all certain. Remember, it is absolutely key to my vision that we fanatically focus on the work level rather than the book/volume level. If/when Wikidata would accept a first-level entity for (potentially) every non-fiction article ever written, we'd be able to use Wikidata as a primary store. I suspect this is not close to reality at this point, and perhaps never.
- 2. Sure, we could grab Open Library data, but again, they are primarily at the book level. I think we should start with two sources: collections that are already work-level (parts of Wikisource, and perhaps early outputs from the Table of Contents of Everything project (see main page)).
- 3. (nonfiction articles/essays in) Wikisource can be the first corpus of texts we try to classify, so I'd say Wikisource is front and center! :) Wikipedia article titles (probably through Wikidata labels) can be one of the values of "aboutness" -- i.e. we can let volunteers say a piece is about William Shakespeare, and thereby link the aboutness assertion to VIAF etc.
- Re adding "cites thing/concept" alongside aboutness -- certainly, we could do that as well. If there's energy around this, we can make it part of whatever we end up implementing, alongside aboutness.
- Is the Pund.it service still alive? It's completely unreachable at the moment, for me. :( Asaf Bartov talk 06:21, 21 September 2013 (UTC)
Feedback from Micru
[edit]Thanks for sharing your thoughts, Asaf. I have discussed some times with Andrea, Gerard, and others about the need of a portal for presenting all the bibliographic information from Wikipedia and Wikisource in an user-friendly format, I am glad that you put it into words. The key for this to happen is of course Wikidata, once the structure is defined (done) and the information from infoboxes, citation templates and Wikisource is imported (that might take some months), it should be quite easy to have a portal as the one you suggest. For items repersenting people we already have something like that: [5]
As for the aboutness, the needed property is already in discussion, however as Bob has mentioned, that is only part of the solution. For the searches to yield more results, the Wikidata implementation of Wiktionary should be in place, then it would be easy to connect synonyms and related words without having to resort to controlled vocabularies.
Even then, I would like to ask you, is it really that useful? I consider that a finer granularity might be more interesting for researchers (thepund.it seems like a good candidate), and if it is about reading recommendations, then recommender engines work quite well, but that is a different story.
About importing all the metadata, I am not sure that would fall within the scope of Wikidata. The mission is to support WM projects, so that there is metadata at all, it is just a byproduct, not the primary aim. Micru
- Thanks for the feedback.
- Please note you're talking about something different from what I describe in the page itself -- you describe presenting existing bibliographic information better, whereas I am talking about a volunteer project to create additional bibliographic information, that largely does not exist today, neither on Wikidata/Wikipedia nor elsewhere.
- I agree that synonyms powered by Wikidata would be a significant improvement for searches.
- Thanks for the pointers to the relevant properties on Wikidata. They will definitely be useful down the road.
- As for usefulness -- I suppose it is just as useful as subject classification for book level publications; perhaps you don't think those are valuable, for monographs? The key here is the work-level classification, which would make discoverable for the first time numerous treasures currently not covered by book-level classification.
- Haven't been able to visit thepund.it yet -- I hope it'll come back online soon so I can take a look. I did almost manage to meet Michele from that team when I visited Pisa about three years ago, but it was logistically impossible, in the end. :( Asaf Bartov talk 06:45, 21 September 2013 (UTC)
Tractatus Logico-Bibliographicus
[edit]1. It is logically possible to describe everything as a text. 1.1. Such a description is just a question of building a suitable logic. 1.1.1. Or, if you wish, picking it from the multi-infinite array of existent, less existent, and nonexistent logics. 2. Hence, everything can be a reference on something else. 2.1. This could also be true in a more traditional sense, lest us define Aristotelian theleological relations in a referational way. 3. Now, we can differ certain kinds of reference, like a mention and an aboutness. 3.1. That difference might, perhaps, be likened to the difference between a variable or function being mentioned in a program, vs it being recalled. 3.1.1. One might also see it as similar to the difference between a metaphor and a likeliness - by itself, yet other kinds of reference - except it is not unavoidable to be agreed upon. 3.1.2. Nevertheless, it might be prudent to note that each such relation being asymmetrically passive-agressive, it should be catalogued at its both ends (or more, depending on the valencies of relations). 3.2. It might be worthy to note that a text can be about something it does not mention. 3.2.1. Something can be unmentioned yet aboutingly present in different ways, like directly and indirectly: "This is an essay about The One Who Must Not Be Called Upon" vs a description of everything around the center which has been diligently omitted. 3.2.2. This phenomenon is, unfortunately, not what many texts are about. 3.3. Then again, it is quite common for a text to mention many entities - or, as we should call them by now, texts - it is not about. 3.3.1. Therefore, it should be evident by now it is quite possible to have entities that mention a lot of things but are not about any of those. 3.3.2. Furthermore, it is also possible to be about nothing, just like it is possible to mention nothing but be about everything (or less). 3.3.3. Now, the question is, how should a universal library catalogue describe such relations? 4. On these premises, we could say that there is a stone in Timbuktu that is about Timbuktu. 4.1. Let it be recorded that we did not say whether the stone mentions Timbuktu. 4.2. Nevertheless, a universal catalogue should be able to catalogue all such texts (as entities) and their relations. 4.2.1. After all, it is no surprise to any librarian that there are books about books. 4.2.2. Hence, it is also no surprise that there are (somewhat probably rarer) books and other texts about themselves. 4.2.3. And if we sit down on a stone to ponder the subject, we might find ourselves quite unsurprised even by a stone being about itself, or autoreferential, autoreferentiality being a rather common, if little-discussed, trait among stones. 4.3. After which notion it would be quite logical to admit that our subcatalogue about all things Timbuktu contains, among the books, stones, birds and other signs, Timbuktu itself. 4.3.1. And in a tiny subcategory of that subcatalogue, there should definitely be enough room for a Timbuktu-related Oxford comma that is (was, will be) notably non-existent in our previous sentence. 4.3.2. Let that be called, for the purposes of logical compactness, a Timbuktu comma (see: Oxford). 5. Henceforth, let us display the little logical carpet of Timbuktuan aboutiveness. 5.1. Timbuktu can be about Timbuktu. 5.2. Timbuktu can be about something else. 5.3. Something else can be about Timbuktu. 5.4. Timbuktu can be about nothing. 5.5. Nothing can be about Timbuktu. 5.6. Nothing else can be about Timbuktu, as Timbuktu is all about itself there can be. 6. Yet, as we know, every text transforms into another, constantly changing and elusive text flow in the mind of the reader. 6.1. No text is final unless it reaches the mind of the reader and starts changing - that's how it becomes alive. 6.2. From this, we can make several conclusions. 6.2.1. There can be no text without a reader, as reader is all the text (in its textness) is about. 6.2.2. There can be no author without a text, because the text is all the author (in its authorness) is about. 6.2.3. Therefore, there can be no author without a reader. 6.2.3.1. Should a reader cease to exist, the whole world of a text would disappear along with it. 6.2.4. It must be true that a text could maintain itself by reading itself but there would be no way to discern the new text from the one read, nor left undiscerned. 6.2.4.1. Such cases would have to be catalogued via Schrödinger's Dual System. 6.3. All that places some unforeseen stress on the cataloguing system, as most of our everyday catalogue logics are not suitable for describing the actual universal references. 7. Luckily, there is an inborn category system in the texts - which, let us remind, means entities in the widest possible sense - themselves. 7.1. Every text is as it is, bringing along its own form of existence. 7.2. That particular form similes it to and differs it from other texts. 7.3. Thence existence, to the extent that any text exists - be it in or out of a mind -, places and locates the text in a complex system of categories according to its form. 7.4. That form, without which there is no text, can be defined as a library code in our catalogue logic. 7.5. Henceforth, the logic we have chosen to define, lets every text be placed to its logical place. 8. We have achieved an ontologically necessary catalogue logic that is not only able to cover all possible, necessary, and - necessarily - impossible texts and references, but also already does it. 8.1. All relations have been defined, all texts have been connected, all universes have been brought to the library. 8.2. All data is metadata. 8.3. Everything is in Timbuktu. --Oop (talk) 21:56, 5 April 2017 (UTC)
- (applause) Ijon (talk) 19:11, 7 April 2017 (UTC)
From Project Runeberg
[edit]It's nice to discover these ideas (by @Ijon:) from 2013 – six years old now – and to see my own Project Runeberg mentioned. Early on in Project Runeberg, categorizing the content was a possible future direction. But only very little work has been made in this direction, in the shape of a few dozen categories and only applied on the work level. One category is for dictionaries (132 books) and another is for books about Russia (137 books + 23 authors). The obvious suggestions from librarians to use a library classification system was initially rejected as there were then too few books to make much sense.
What has happened since 2013 in Project Runeberg is that a lot more has been scanned and made available online, including several journals, and many of them have complete tables of content. My own favorite is Ord och Bild, a literary/cultural journal of which we present 58 volumes from 1892 to 1949, a total of 10,100 articles on 41,500 pages. This journal alone could benefit from having its articles categorized.
One problem that I see with this MMOB proposal it that it is too vague and visionary ("wouldn't it be nice if..."). For a successful project, I think it is important to get started and to be useful already from the first day of work. So instead of a huge database and a plan that lasts a decade, I think we need a simple file format for the categorization. Should we use Dewey decimal classification? If so, the decimal numbers can be entered by keyboard at first, and automated into some semi-automatic point-and-click system later. --LA2 (talk) 02:02, 3 February 2020 (UTC)
Open Library has table of contents
[edit]Currently the only source I know for table of contents data that is already processed is the Open Library, eg. Sapiens By processed, I mean already separated from other kinds of content, you don't have to extract it from a page scan, and has semi-standardized formatting/markup. For the markup, see the editing page. Daask (talk) 14:02, 11 June 2024 (UTC)