Research talk:Characterizing Wikipedia Citation Usage

What data we need

Below I'm listing a set of data points discussed across multiple research efforts to better understand Wikipedia citation usage. The following list will be shared with those involved for their comments and iteration on adding dropping items. I expect for us to converge on the list in the next few days given that many components of this has already been discussed in the past.

An event is triggered when the user interacts with citations on a page. We want to be able to register:

timestamp: the timestamp of the event (events are triggered when a user lands on a page for cases where the user hovers-over or clicks a citation, when a user clicks on an external link, ...).

externalClick: click on external URLS along with the corresponding timestamp for those clicks.

revisionID: of the page where the event occurred in.

pageId: the ID of the page where the extenral link was clicked.

pageNamespace: the namespace of the page where the event occurred in (pageview, click on citation).

sectionID: ID of the section containing the external link that was clicked/hovered-over.

elinkText: The text of the external link; space normalized string within anchor element

elinkProtocol: The protocol of the clicked external link -- if specified

elinkDomain: The full domain name of the clicked external link

elinkPath: The path of the clicked external link -- if specified

elinkQuery: The query string of the clicked external link -- if specified

freelyAccessible: True if the clicked external link has a green 'Freely Accessible' icon next to it

elinkOccurrence: The number of times this external link's href occurs on this page

internalClick: clicks on page-internal links (e.g., “[1]”) that take the user to the reference section at the bottom Event when user hovers over reference (e.g., “[1]”) in main page articles

upClicks: Clicks that take the user from the reference at the bottom back to the anchor (e.g., “[1]”) in the main text (e.g., on “^”)

wikiProjects: A string array of all the WikiProjects this page belongs to

pageQualities: A string array of WikiProject quality assessments for this page. Strings will be formatted as "WikiProject:quality": For example: ['Medicine:B-Class','Veterinary_medicine:GA-Class']

totalElinks: The total number of external links on this page. Excludes links marked as class 'external' where the FQDN matches the FQDN of the current page

elinkPosition: The ordinal position of this external link within the list of all external links on this page

citationNumber: If this external link falls within a cited reference list, the ordinal position of the citation within the reference list

citationInTextRefs: If this external link is a cited reference, the number of times it's cited on the page. Count of (span[class='mw-cite-backlink']/a)

citationPrecedingLabel: If this external link is a cited reference and the preceding link is an identifier label (DOI, PMID, PMC), report the preceding link label.

timeBeforeClick: The number of seconds spent on this page before this external link click occurred

Is this easy to measure at the data collection time or is it better to do post-processing for this? I /think/ the latter.

elinksClicked: The number of external links clicked by this user agent while viewing this page during this session

Drop this if we can have sessionID and userID as it can be counted afterwards.

Data Collection

Latest comment: 6 years ago7 comments4 people in discussion

We are in early stages of understanding the data -- this means the data collection plans are work in progress, and we will iterate on. As of June 14, 2018, here is what we know about the data collection steps:

We will collect data for a few days, sampling 1 to 15% of traffic, depending on the sparsity of entries (we do not know the frequency of citation usage, so we may have to change this plan based on the initial validation steps)
After this period, we will check the data quality. Once that’s verified, we intend to do data collection at 100% sampling rate for a period of one week.
The schema for the data we will collect is here: Schema:CitationUsage. In the schema, we are not storing the IP address, this is automatically collected by the event capsule
While the Schema does not include the IP, the clientIP is collected by the EventLogging capsule, and the information gets purged (dropped) every 90 days, see Data retention and purging
Initially, we intend to purge all data at 90-day time intervals until we get a better sense of what kind of signal we can get from this kind of data.
We won't collect data from logged-in users.

Could you please translate that garbage into English, for the benefit of native English speakers like me? Just one example:

"We intend to purge all data at 90-day time intervals until we get a better sense of what kind of signal we can get from this kind of data."

Gobbledegook. Narky Blert (talk) 21:58, 21 June 2018 (UTC)Reply

IMAO, this question and project violate Wikipedia:WP:AGF. Narky Blert (talk) 22:25, 21 June 2018 (UTC)Reply

Nonsense, the WMF has this data, and if we trust the WMF to delete it — we can trust their research projects to delete it. I realize that in the wake of the Cambridge analytica scandal, any collection of data may be seen as controversial, but that is not a reason to stop a project aimed at producing useable and actionable information for Wikipedians. As I understand it, the end-goal here is to help Wikipedia, and if anyone is in violation of "assume good faith", it is you Narky Blert. There is no need to be rude and calling this terse proposal "Gobbledegook" or "garbage".

If the use of IP addresses is what concerns you, it might be appropriate to request the research refrain from using IPs at all, but that will limit its utility, because it will for example not be possible to track whether hospital or university IP ranges more often click through to sources, and if so, what sources.

To be on the safe side, it may be appropriate to run through or link to review by an ethics board. Is there any link to a Stanford ethics review? And where is the original research proposal, the project currently reads as "in progress", not proposed. It strikes me as fanciful to suggest this is the only information there is on the project. I assume it is not the same project as: Research:Citation_Click_Data CFCF 💌 📧 05:19, 22 June 2018 (UTC)Reply

@Narky Blert: Your feedback is welcome as long as you keep your comments constructive. If you have a way to more constructively reformulate your question, I'd be happy to respond. --LZia (WMF) (talk) 20:26, 29 June 2018 (UTC)Reply

@CFCF: Thank you for your comments. I'm responding as Miriam (WMF) has likely signed off for the day and there is the weekend ahead of us. The proposal for this research is really the content that you see in this page. We put the status as in-progress since we worked on a Google document to arrive at a proposal that all researchers agree on and then moved that to meta. There is no other document I'm aware of (beyond meeting notes that has happened around this research, and emails) that we can share here. The proposals for Formal Collaborations tend to be at this length and level of details in many cases. Re your point about Stanford ethics review: This is a research direction introduced by Wikimedia Foundation as part of our effort to create a body of actionable understanding about knowledge integrity on Wikimedia projects (one of our upcoming annual plan programs is in this direction, too.). For the projects where WMF is the initiator of the research direction and we work with external formal collaborators on, we do not rely on external entities to approve the process. Instead, we review such proposals within the Research team and ask for input from other teams as needed. (In this case, Analytics and Security). Does this address your question? --LZia (WMF) (talk) 20:26, 29 June 2018 (UTC)Reply

As I understand it, the end-goal here is to help Wikipedia, and if anyone is in violation of "assume good faith", it is you Narky Blert. There is no need to be rude and calling this terse proposal "Gobbledegook" or "garbage".

As a native English speaker, I consider "We intend to purge all data at 90-day time intervals until we get a better sense of what kind of signal we can get from this kind of data." to be total BS. WTF is that supposed to mean? English, it is not. Narky Blert (talk) 22:11, 29 June 2018 (UTC)Reply

"We are going to delete the data every 90 days, until we can see what sorts of useful information we can find out from this kind of data". I am quite sure that the terminology was not chosen to be deliberately confusing. (@The authors of the page: "Signal", in the statistics sense, is apparently not a commonly-known term outside of certain circles. Good to know for future reference.) --Yair rand (talk) (author of Reference Tooltips) 18:04, 2 July 2018 (UTC)Reply

Archive.org

Latest comment: 6 years ago2 comments2 people in discussion

Hi - Would it be possible, in the second round, to differentiate between https://web.archive.org and https://archive.org .. the former is for web archiving, while the later is for a library of scanned books, movies, audio etc.. it's a common source of confusion that archive.org = web archives, when in fact they have repositories of source content that rivals Google Books and Commons in size. It would help to see these differentiated since we have so many links to both. The book collection is actually larger than Google Books in terms of full-text PD books, it is the largest collection in existence, most of it scanned by archive.org themselves. -- GreenC (talk) 20:45, 3 September 2018 (UTC)Reply

Thanks for your precious suggestion, we will make sure we avoid this confusion in the second round! Miriam (WMF) (talk) 11:05, 4 September 2018 (UTC)Reply

Percent of pageviews on articles with zero references

Latest comment: 5 years ago2 comments1 person in discussion

Hi @Miriam (WMF):, thanks for this thorough and important research! I'm trying to use your results to verify some operational metrics, and I realized that I can't reconstruct an important detail. In the first round of analysis, I see that roughly 50% of 2M pages viewed have zero references, and in the second round, 25% of 5.4M pages viewed have zero references. This suggests that pageviews follow something like a power law distribution, and pages with more references are viewed more often. What I'm trying to estimate is what proportion of pageviews are for a page with zero references. Is this data still available somewhere that I could access it, for example files on the Wikimedia analytics cluster? Regards, Adamw (talk) 09:27, 21 November 2019 (UTC)Reply

Hi @Adamw: thank you for your interest in our research! I'll check with @Tizianopiccardi: and get back to you. We should have the list of pages with zero references on the analytics cluster!

Please don't worry about this request any longer, I was able to find an answer to my questions. I'm interested in running some of your study's metrics across all wikis, I'll follow up here when I have something to show. Thanks again, Adamw (talk) 14:23, 2 January 2020 (UTC)Reply

Question about number of references per page

Latest comment: 5 years ago3 comments2 people in discussion

I see that the first-round analysis used a regex /<ref[^>]*[^\/]>|<ref[ ]*>/ to count the "<ref>" tags in article wikitext, and got curious about what this matches. Running on the Anarchism article, this regex finds 4 references, when there are actually 158. This is due to heavy use of the "{{sfnm}}" and "{{sfn}}" templates. Is there an updated version of this algorithm, which runs against rendered HTML for example? There's a hint on the second round analysis that the HTML was scanned rather than wikitext, for this round.

Another question is about what we should be measuring. The regex above will find all ref tags with content, which (ignoring templates) equals the total number of unique references defined in the article. Maybe we also need to know the total number of footnote markers displayed, if the idea is to approximately control for reference density? I think this would almost always be higher because it includes ref reuses.

Here is a proof-of-concept for what I mean by, scan the HTML for actually rendered footnotes. Adamw (talk) 12:34, 3 January 2020 (UTC)Reply

@Miriam (WMF): (I forgot to ping on the post above.) Adamw (talk) 08:28, 31 January 2020 (UTC)Reply

@Adamw: Tiziano is looking at your question right now and will get back to you, sorry for the delay!

@Adamw: Hi Adam, yes we realized that Wikitext is not reliable to get all the references. We collected the HTML, parsed it (BeautifulSoup in our case), and we selected all the li tags children of ol.references (where "references" is the class of the "ol" tag). For the reference in the text, we used the selector sup.reference > a --Tizianopiccardi (talk) 13:41, 12 February 2020 (UTC)Reply

Open data

Latest comment: 4 years ago3 comments2 people in discussion

Is any of the data here going to be shared? I'd be interested in the number of clicks per domain, mostly.

It would also be interested to repeat the data collection after some significant events: for instance we've just added doi-access=free to over 100k citations, which in theory should increase the click rate if users understand what the locks mean (but so far there's no data to support the hypothesis that they do). Nemo 15:59, 22 April 2020 (UTC)Reply

Thanks for your feedback! We cannot release the traffic data for privacy reasons, but we have recently published a paper with aggregated results. Would you be interested in number of clicks for all domains, or just the top X? We have plots with raw counts for the top domains, but happy to look into weather we can release more detailed data. Thanks! Miriam (WMF) (talk) 10:58, 28 April 2020 (UTC)Reply

Thanks, I've read the paper after writing here as Leila kindly reminded me about it.

I would be interested in the number of clicks for certain not so popular domains, so having the top 10k or something would be pretty interesting. The long tail must be very long, but if the top domain has 400k clicks the lesser ones must have very few and under a certain threshold it starts getting potentially problematic for privacy too.

As for the raw data, I was mostly interested in understanding better how the correlations for table 1, specifically how the text for a reference and the connected sentence were extracted and how the results would change if different methods were used (for instance considering certain HTML attributes). Maybe this is visible in the previously published python notebooks but I didn't find it in my first cursory check, I'll try to look better later. Nemo 16:34, 29 April 2020 (UTC)Reply

Project ended?

Latest comment: 2 years ago1 comment1 person in discussion

Seems published https://doi.org/10.1145/3366423.3380300

@Miriam (WMF): Is this project ended? It is not closed out here on Meta-Wiki project page. Bluerasberry (talk) 10:02, 17 October 2022 (UTC)Reply