Research talk:Scholarly article citations in Wikipedia/Work log/2015-02-09
Add topicAppearance
Monday, February 9, 2015
[edit]I did some work this morning with v0.0.5 of https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia.
Extract a random sample of DOI citations
[edit]$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | shuf -n 1000 > sample_doi.1k.tsv
Using crossref to check DOIs
[edit]$ cat sample_doi.1k.tsv | awk -F"\t" '{print "http://api.crossref.org/works/"$6"/agency"}' | xargs -I {} bash -c "wget --quiet -O- '{}' | sed -r 's/(.*)/\1\n/'" > doi_agencies.1k.json
Convert dois to sorted sets and diff
[edit]$ cat sample_doi.1k.tsv | tail -n+2 | cut -f6 | sort | uniq | tr '[:upper:]' '[:lower:]' > sample_doi.1k.set.tsv $ cat doi_agencies.1k.json | mwstream json2tsv message.DOI | sort | uniq | tr '[:upper:]' '[:lower:]' > doi_agencies.1k.set.tsv $ diff sample_doi.1k.set.tsv doi_agencies.1k.set.tsv | grep "<" | sed -r "s/>\s(.+)/\1/" > missing_doi.1k.tsv $ wc missing_doi.1k.tsv 103
Spot-checked missing dois
[edit]shuf -n10 missing_doi.1k.tsv
- http://dx.doi.org/10.1007/pl00005669 -- resolves
- http://dx.doi.org/10.1016/j.intell.2006.03.005 -- resolves
- http://dx.doi.org/10.1080/0963749032000045837 -- resolves
- http://dx.doi.org/10.1525/auk.2009.03409.2 -- not found
- Added in http://enwp.org/?oldid=630201543&diff=prev
- Extracted as expected from "<ref name=Auk>{{cite doi|10.1525/auk.2009.03409.2}}</ref>"
- http://dx.doi.org/10.1666/0022-3360(2005)079[0981:arodly]2.0.co;2 -- resolves
- http://dx.doi.org/10.1162/jinh.2008.38.3.499 -- resolves
- http://dx.doi.org/10.1109/iadcc.2014.6779425 -- resolves
- http://dx.doi.org/10.4202/app.2011.0120 -- resolves
- http://dx.doi.org/10.1007/s10482-011-9605-y -- resolves
- http://dx.doi.org/10.1016/0030-4220(76)90098-0-- resolves
shuf -n10 missing_doi.1k.tsv
- http://dx.doi.org/10.1002/(sici)1521-3773(19990215)38:4<428::aid-anie428>3.0.co;2-3 -- resolves
- http://dx.doi.org/10.1084/jem.20012024 -- resolves
- http://dx.doi.org/10.1017/s0266462309090035 -- resolves
- http://dx.doi.org/10.1093/hmg/6.2.317 -- resolves
- http://dx.doi.org/10.1101/gad.989402 -- resolves
- http://dx.doi.org/10.1021/ed007p2875 -- resolves
- http://dx.doi.org/10.2307/27570652 -- not found
- Added in http://enwp.org/?oldid=557094374&diff=prev
- Extracted as expected from "| url = http://www.jstor.org/stable/10.2307/27570652"
- http://dx.doi.org/10.1093/oxfordjournals.tropej.a057419 -- resolves
- http://dx.doi.org/10.1093/oi/authority.20110803100400841 -- not found
- Added in http://enwp.org/?oldid=596786148&diff=prev
- Extracted as expected from "|url=http://www.oxfordreference.com/view/10.1093/oi/authority.20110803100400841"
- Might not be a DOI, but it does look like one.
- http://dx.doi.org/10.1086/302282 -- resolves
Well... that looks good. Almost all the IDs that aren't resolving with crossref resolve just find with dx.doi.org. And the ones that don't seem to be fine extractions.
Counts
[edit]- DOI/Page pairs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | wc
- 742565 5269445 63756121
- PubMed ID/Page pairs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | wc
- 437484 3011320 30215760
- Unique DOIs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f6 | sort | uniq | wc
- 524357 524357 13332518
- Unique pages with DOIs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f1 | sort | uniq | wc
- 172644 172644 1438573
- Unique pages with PubMed IDs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | cut -f1 | sort | uniq | wc
- 68648 68648 575015
--22:52, 9 February 2015 (UTC)