Jump to content

Talk:CopyPatrol/Archives/2023

From Meta, a Wikimedia project coordination wiki
Latest comment: 11 months ago by Diannaa in topic User whitelist

Feedback requested from CopyPatrol users

Hello CopyPatrol users! Turnitin, the company behind IThenticate which powers CopyPatrol, has asked us to collect feedback from our users. This is to help build our partnership with them and ensure the long-term stability of CopyPatrol, so your input is very much appreciated! The questions may seem a bit broad, but if you are able to elaborate on any of them please do. For all intents and purposes, "iThenticate" in this context can be viewed as CopyPatrol, since the reports you see surfaced there come from iThenticate. Some of you use the "iThenticate report" link as well; if you do, please describe your workflow in Q3. The questions are as follows:

  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free?
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?)
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches?
  4. How does iThenticate help you prevent copyright violations?

Thank you for taking the time to answer, and also your time and energy spent helping keep Wikipedia clean of copyright violations! I am pinging a few of our English-speaking power users, but anyone should feel free to respond: @Diannaa, @DanCherek, @L3X1, @Sphilbrick, @Ymblanter. Warm regards, MusikAnimal (WMF) (talk) 20:25, 24 August 2022 (UTC)

Responses from DanCherek
Hi MusikAnimal (WMF), thanks for reaching out (and for your assistance with keeping CopyPatrol up and running)! Here are my thoughts:
  1. How does iThenticate help you in your work of keeping Wikipedia plagiarism-free? Having the ability to automatically scan large edits for copyright violations is incredibly helpful. There is a relatively small group of editors working on copyright cleanup, and with the hundreds of thousands of edits that are made every day, there's no way that they could all be manually scanned for copyright issues, let alone accurately identifying the sources of copied text. iThenticate makes this task much more feasible by identifying potential violations and flagging them for human review, distilling this enormous (and important) task into a much more manageable one. By having this system where edits are automatically reviewed, we're able to detect and deal with copyright issues even on articles that aren't actively being watched by other editors, and quickly handle them as they come up (making removal easier, compared to cases in which a copyright violation isn't discovered until years after the fact).
  2. How would you describe the main benefit of using iThenticate? (e.g. report accuracy? Time saving?) There are a few features that I think are particularly valuable. One is that it is really good at matching text to sources that may not otherwise be readily accessible or findable. Paywalled sources (such as journal articles), or historical versions of websites that have since been modified or taken down, are frequent sources of copying that we see on Wikipedia, but you wouldn't necessarily be able to find these matches from a Google search. So the ability to have a better sense of where some copied text comes from is really helpful. Another feature that I like is the way that the overlapping text is highlighted in the iThenticate interface, including the use of different colors for different sources. It makes it easy to tell, from a glance, which specific phrases in an edit were clearly copied from somewhere, and which parts of the edit might warrant further investigation.
  3. What do you do when you identify text similarities in the article you are reviewing? Could you please describe the process of working with the detected text matches? For each article that appears in CopyPatrol, I will open two new tabs initially: the diff of the edit, and the iThenticate report. I take a look at the iThenticate report to try to get a sense of what we're dealing with. For example, if the potential sources are all Wikipedia or Wikipedia mirrors, it may be a case of copying within Wikipedia, in which case I would look at the actual edit and see if that's the case, and whether proper attribution was given. If it looks like text was copied from a copyright source that is not compatibly licensed, I will look at the potential matches, try to identify the actual source copied from, and then remove the text from the article. Because we're dealing with recent edits (typically made in the past day or two), we don't have as many issues with reverse copying than if we were investigating edits from, say, a decade ago. I will also look at other recent edits to the article, particularly if the editor in question has made a series of edits, to see if there are other copyright issues that weren't flagged by CopyPatrol, and I will also look at edits that the user has made to other articles. Often when there is a fundamental misunderstanding of the copyright policy, the copyright issues are not confined to a single article. I mentioned above that I found the highlighting of overlapping text useful. If an edit clearly looks like it's been copied from somewhere, but the iThenticate-identified sources are all offline or not promising, I use the highlighted text to determine which phrases to enter into Google as I look a little harder for the original source.
  4. How does iThenticate help you prevent copyright violations? Besides the slight overlap with Q1, I think this question really gets at one of the most important things about CopyPatrol, which is that it helps us identify copyright issues hopefully early on in a Wikipedia user's editing history and to educate them about the copyright policy before they make too many edits. Wikipedia's contributor copyright investigation (CCI) project is severely backlogged with cases in which copyright issues were discovered after someone had already made tens or hundreds of thousands of edits. That still happens, but hopefully being able to communicate with the editor earlier on can create a better situation for everyone involved. It also lets us create a paper trail in case the copyright issues persist and further action is needed, and it can help inform when a new CCI may be needed.
Hope this helps. Let me know if I should elaborate on anything else. DanCherek (talk) 23:38, 24 August 2022 (UTC)
Responses from Diannaa
  1. Wikipedia has grown to the point where we receive thousands of edits every hour, which makes it impossible to monitor recent changes without automated tools. The assistance of iThenticate is invaluable to us because it provides a vital and reliable service that can check for copyright problems without the need for our volunteers to be involved in maintainance of the service.
  2. There are some huge benefits. Our previous detection system, CorenSearchBot, checked only new page creations. The iThenticate service checks all additions over a certain size, and thus provides a lot more coverage. The reports are added to our queue almost immediately and we clear all the open reports within 24 to 36 hours, which means people who add copyright material are notified quickly as to what they did wrong and what our expectations are. Quickly notifying people of problems means that we have fewer new editors who think it's okay to add copyright content, and means cleanup is less onerous in the long run. Unlike social media and other sites where people can contribute content, we take copyright very seriously, as to failure do so would have a negative impact on our efforts to be taken seriously as a valid scholarly resource. And the Turnitin system can see behind many paywalls so that we can assess and remove content that we could otherwise not even detect. I am pretty sure CorenSearchBot was not sophisticated enough to do that. CorenSearchBot was retired in June 2016, when we got our CopyPatrol interface perfected.
  3. Assessing reports: First I look at what type of article has been flagged (biography, places, science, or current events for example) as each has specific types of common issues. Next, I look at the url that iThenticate has flagged, and it's a journal article, I will immediately click on the iThenticate link, because the Turnitin system can see behind many paywalls to view content that would otherwise be inaccessable without a subscription. I check to see if the source webpage is compatibly licensed, and if it's not, I remove the copyright content from the Wikipedia article. Sometimes it becomes obvious that the entire article needs checking, or the editor's entire edit history needs checking. So one iThenticate report can expand into a larger cleanup effort! Then I perform revision deletion if appropriate and notify the editor with either a template or a hand-written note.
  4. The way iThenticate helps prevent problems is through the opportunity to educate users as to our expectations. An editor (whether a newcomer or a veteran) is a lot less likely to add copyright material to Wikipedia if they know there's an automated detection service in place. Diannaa (talk) 14:15, 26 August 2022 (UTC)
Responses from Sennecaster
  1. I am usually busy in other parts of copyright cleanup, but when I do go on CopyPatrol, and from what I see at the "second line" of defense at en:Wikipedia:Copyright problems, it really decreases the amount of manual reviewing and source hunting that we have to do. I find that iThenticate helps filter down what we should look at too, and it can crawl behind paywalls that we as volunteers sometimes cannot. Earwig's copyvio detector also has an iThenticate option, but it rarely works there unfortunately--when it does, I sometimes find issues that were not previously exposed with the other options.
  2. iThenticate is pretty accurate! Sometimes it bugs out and won't let me preview the comparison, but it gives me a source, instead of making me hunt down one. I find that dealing with CopyPatrol reports is extremely fast, even on some of the more tricky ones, so while it may take me a few days to go through one set of Copyright problems listings, I can completely handle a similar amount of CopyPatrol reports (outside of admin tool stuff like revision deletion) in a quicker fashion.
  3. I check the diff and article history first to see if the edit was already reverted by a Recent Changes patroller or if more content was added that also needs to be checked. I then look at the source url, and if I can't discern whether or not it could be a special case, I open the source. In most cases I remove the violation, or attribute it if it was copied within Wikipedia, and then request revision deletion if necessary. I then warn the person. I don't find myself using the iThenticate report that much, since the sources that CopyPatrol flags do not appear in the report when I check.
  4. I think that iThenticate itself doesn't prevent copyright violations, but rather gives us the means to help new people who (understandably so) do not understand copyright on Wikipedia understand before it becomes a problem for everyone. We can find them and give them the guidance needed, kind of like recent changes patrollers can help identify new people who may not know of a certain policy but are here in good faith. It also helps us, at times, find users and pages who have serious issues, and need a referral to our other processes, like Copyright problems or Contributor Copyright Investigations. It gives us ways to prevent long-term copyright violations, but I'm not so sure that it does anything to prevent the total amount of reports we handle or will see at our other processes. Sennecaster (talk) 22:43, 26 August 2022 (UTC)
Responses from Moneytrees

The above responses have basically covered anything of substance I would say, so I will provide briefer answers:

  1. The iThenticate reports are able to detect close paraphrasing better than other community tools and has access to a wider array of sources that are otherwise difficult verify coping from. The reports are probably the most invaluable tool there is when it comes to patrolling for plagiarizing edits.
  2. The main benefits I find are the access to difficult to access sources and accuracy. Given the availability and price of some sources that are copied from, copyright violations can stay in articles for several years before being removed. The iThenticate reports help prevent this by showing comparisons to these sources.
  3. DanCherek has summarized the process the majority of reviewers go through; I have nothing to really add.
  4. It has helped us become much better at catching copyright violations early in editor's careers, helping prevent future ones, and has helped keep track of editors who have repeatedly violated copyright. Moneytrees (talk) 23:01, 3 September 2022 (UTC)

I don't have much to add to the above - except that wider coverage is still needed. False negatives due to unavailable sources are still too frequent. MER-C 18:04, 13 September 2022 (UTC)

Response from L3X1

Without the tools made by ithenticate it would be functionally impossible for me to do anything about plagiarism. Having a program to detect possible violations and format in the queue that I can easily interact with and delivers the information I need right at my fingertips is irreplaceable. enL3X1 ¡‹delayed reaction›¡ 22:16, 18 September 2022 (UTC)

@MER-C, Moneytrees, Sennecaster, DanCherek, Diannaa, and L3X1: A belated but sincere THANK YOU for your well-articulated and thorough replies! :) I realized I failed to mention that this was for a case study. The hope is to publish a blog post (authored by Turnitin) on Wikimedia Diff. We saw the draft today and they are using direct quotes from some of you and linking to your user page. I wanted to make sure you were okay with this? I assume so since your words are already in the public eye here. I'm not sure when the post will go live but I will certainly let you know. Thanks for helping us build up our partnership with Turnitin! Best, MusikAnimal (WMF) (talk) 02:09, 2 December 2022 (UTC)
No issues for me. Thanks, DanCherek (talk) 02:19, 2 December 2022 (UTC)
I am okay with this and would be pleased to see the resulting blog post. Diannaa (talk) 03:21, 2 December 2022 (UTC)
Fine with me :) Sennecaster (talk) 12:55, 2 December 2022 (UTC)
That's cool, I am fine if my words are used. Moneytrees (talk) 20:16, 4 December 2022 (UTC)
yes I am fine with being quote and/or linked to. thanks for reaching out. enL3X1 ¡‹delayed reaction›¡ 20:44, 5 December 2022 (UTC)
@Moneytrees, DanCherek, Diannaa, and Sennecaster: The blog post was published and I guess I wasn't notified, but anyway here it is should you want to read it. The rest of you I didn't ping were not mentioned. Thanks again to all of you for your feedback and participation in this PR push! Thanks to you, we should soon hopefully have enough credits secured for CopyPatrol to last for many more years. Warm regards, MusikAnimal (WMF) (talk) 02:13, 18 January 2023 (UTC)

Diannaa stepping back, new features

@MusikAnimal (WMF) (Also ping @DanCherek, MER-C, and Diannaa-- please add anything here that you think might also be useful) If you haven't seen, Diannaa is going to be doing less work at copypatrol moving forward. This is a good a time as any to address some long running issues around how work is structured. It's no efficent, healthy, or fair for two or three people to be doing the lion shares of the work. We need to make patrolling and dealing with copyright violations like recent change patrolling, in that the majority of editors have a baseline knowledge of what to do. We need to update the processes around dealing with copyright violations to account for this, and I have two ideas in particular:

  • We should have a feature that allows you to search through all the times a specific editor has been flagged at copypatrol.
  • We should have a feature that allows you to look through all the reviews someone has done.

If for whatever reason these cannot be used by the general editing group at copypatrol, would it be possible to add an "admin" role at copypatrol that could do this? If this isn't the correct venue to request these features, what would be? Thank you, Moneytrees (talk) 21:48, 28 January 2023 (UTC)

Hi Moneytrees, I never got a ping notification for this discussion, and noticed only by chance. You might like to notify the others of its existence via some other method in case they never got pinged either. Thanks. Diannaa (talk) 15:11, 2 February 2023 (UTC)
Hey @Moneytrees! I didn't get this ping either, but I did get your message on my talk page (at the time I was on holiday). I'm sad to hear the mighty @Diannaa will be taking a break! She indeed has done the heavy lifting for some years now.
Myself and CommTech are happy to look into streamlining CopyPatrol however you think it will help. In my opinion though, the main issue is lack of enough interested patrollers. I think running some sort of campaign to get more folks involved on enwiki is probably going to yield the best results.
Now, looking at your two specific suggestions:
  • a feature that allows you to search through all the times a specific editor has been flagged at copypatrol
    Partially doable, and would be very slow. We don't store any data about the editor in the CopyPatrol database, only the revision ID. While we can do a query to find all revisions by a given editor that exist in CopyPatrol, this wouldn't work if the revisions no longer exist. I.e. see the reviewed cases and you'll notice that now-deleted pages don't have editor info (example). We could change CopyPatrol to start storing user data, but this would be expensive and costly for the benefit it provides, I'm afraid.
  • a feature that allows you to look through all the reviews someone has done
    This is doable and quite easily at that! If you can file a Phabricator task with the CopyPatrol task, we'll get it triaged in the next meeting. Or I can write a task when I find the time.
Best, MusikAnimal (WMF) (talk) 21:55, 28 February 2023 (UTC)
@MusikAnimal (WMF) I'm planning on writing a sort of "guide to copypatrol" and some increased community activity for when I have the time. I've created a task on Phab related to the searching reviews feature, let me know if I did it wrong. Moneytrees (talk) 04:00, 5 March 2023 (UTC)
Hi, @MusikAnimal (WMF)! I saw the task open up at Phabricator and wanted to take a stab at it since I had free time. I forgot to read this thread and didn't notice that you had plans to get it done. The PR can be found here; please feel free to close if it's an overreach. Chlod (say hi!) 06:35, 5 March 2023 (UTC)
Not an overreach at all! We had not starting working on this. Thank you very much creating a PR :) I'll get to reviewing it soon. MusikAnimal (WMF) (talk) 18:52, 6 March 2023 (UTC)

Please exclude known Wikipedia mirror-sites from the result

When I see a page in new page feed, it tells me copyvio is detected on a certain page (details here). However, when I looked into these details, I discovered that this page is actually a copy from Wikipedia (aka mirror site). Then there comes the logic below: "copying" texts from a website that mirrors Wikipedia (exact match, not even "close paraphrasing") is a "copyvio". What a droste effect! -- U.T. 02:19, 23 March 2023 (UTC)

Notifications

Hello @MusikAnimal (WMF): , thanks for this helpful tool. I suggest to add "Notifications" with the number of "open cases", on the relevant wiki. This will help editors to solve these cases. I put the link of the tool on the template of "Admin-Tasks" on Arabic-Wikipedia, If a BOT adds the number of "open cases" of ArabicWiki on that template, in the form "(number)" like (5) for example, this will help Admins and editors to open the tool to solve these cases. The BOT can update that number from time to time (four times a day or more). Thanks for you. Dr-Taher (talk) 22:22, 3 April 2023 (UTC)

Hello @MusikAnimal (WMF):. Now, I check this page everyday, to see if there are new cases to review. Many days there is no cases to review, this wastes our time, then I may forget about this tool. BUT, If I receive a notification of new cases, it will be much better. Dr-Taher (talk) 14:11, 5 April 2023 (UTC)
@Dr-Taher That's a fantastic idea! I have filed a task at phab:T334264. I hope we can get this implemented soon and when we do, I will let you know :) Best, MusikAnimal (WMF) (talk) 23:43, 6 April 2023 (UTC)
Hello @MusikAnimal (WMF), more than 30 days, and no action have been done! Dr-Taher (talk) 05:57, 9 May 2023 (UTC)
Our colleague (@لوقا:) used his Bot to find the number of open cases, and we will use it to notify Admins and Active Editors to solve these cases. If this can help in solving the ticket, you can contact him for help. Dr-Taher (talk) 06:27, 9 May 2023 (UTC)

"there is an error connection to database"

I get this popup whenever I try to mark something as reviewed or no action needed. enL3X1 ¡‹delayed reaction›¡ 21:53, 6 April 2023 (UTC)

@L3X1 I believe this is now fixed. Sorry about that! This was due to some change that happened as part of phab:T333471. MusikAnimal (WMF) (talk) 23:11, 6 April 2023 (UTC)
Thanks! enL3X1 ¡‹delayed reaction›¡ 12:49, 7 April 2023 (UTC)

No new cases in WP:es

There are no new cases in Spanish Wikipedia since 2023-04-02. Thanks in advance. LMLM (talk) 08:02, 14 April 2023 (UTC)

@LMLM Apologies for the late reply. It did seem the eswiki job was "stuck", so I restarted it and it seems to be functioning properly again. There are no new reports as of the time of writing but I'm guessing you'll see some soon. Best, MusikAnimal (WMF) (talk) 05:24, 20 April 2023 (UTC)
@MusikAnimal (WMF) Thank you so much. Now it is working fine. Best regards, LMLM (talk) 08:29, 20 April 2023 (UTC)

I'm reading this now and I think it's a bit ambiguous as to whether tagging a page for revision deletion should be marked. I've always understood it to be the case that adding a revdel tag would be sufficient to mark the page as fixed, and I don't see much of a reason to draw a distinction between adding a revdel tag and adding a G12 tag for purposes of determining whether a case is "open" or not.

If we could change the text from If you fixed the problem or tagged the page for deletion as a copyright violation, mark it as "Page fixed" to If you fixed the problem, tagged the page for revision deletion, or tagged the page for deletion as a copyright violation, mark it as "Page fixed", I think this would clarify the language and make it more in line with common practice. Are there any objections?

CC: Tails Wx, Diannaa, Sphilbrick, and DanCherek, who may be interested. — Red-tailed hawk (nest) 17:44, 1 September 2023 (UTC)

Yeah, I think that's okay, since just stating that tagging a page for deletion as a copyright violation isn't going to clarify about deleting revisions. Tails Wx (talk) 18:04, 1 September 2023 (UTC)
I agree, adding that pages should be tagged for revision deletion should be added to the instructions.Diannaa (talk) 19:24, 1 September 2023 (UTC)
Sure, and I would suggest also modifying the next part about adding pages to watchlist (temporarily) to include pages tagged for revdel as well, in case the tag is incorrectly removed before an administrator reviews it. DanCherek (talk) 23:34, 1 September 2023 (UTC)
Seems like we have consensus. I'll modify the instructions thusly. — Red-tailed hawk (nest) 01:29, 2 September 2023 (UTC)
Late to the table, but in agreement with all points. Sphilbrick (talk) 12:27, 4 September 2023 (UTC)

RevDel'd diffs get marked No Action Needed?

ran across 3 in a row (https://copypatrol.toolforge.org/en/?id=102542253 https://copypatrol.toolforge.org/en/?id=102542234 https://copypatrol.toolforge.org/en/?id=102542204) and marked them as no action needed. Is there a way to make them not show up in copy patrol? enL3X1 ¡‹delayed reaction›¡ 01:45, 28 September 2023 (UTC)

I am not involved in the development of this tool, but I've occasionally noticed something in the same vein — the tool identifies a problem which has been addressed by some editor other than those involved in reviewing the reports, so has not been identified as fixed. The thought has crossed my mind that it would be nice to know about this but I think it might be challenging to do so. Having said that, it's is my experience that it's helpful to identify the goal, because sometimes what sounds like an intractable problem has a reasonable fix..
I'll start with my summary of why it's a problem, using Ruth Yeazell as an example:
  • at 00:41 28 Sep an edit was made to the article adding some copyrighted text
  • At some unknown time shortly after, the edit was examined by Copy Patrol and identified as a potential copyright violation.
  • Almost immediately thereafter the report was added to the database
  • at 00:42 Gobonobo reverted the edit. (I'm guessing this edit occurred before the report was added to the copy patrol logs but I'm not sure that it matters)
  • at 01:33 zzuuzz perform the revision deletion. (I'm speculating but it seems likely this action occurred after the report was added to the database)
If we want the database to reflect the fact that the material in question has subsequently been removed, this means that the tool has to constantly revisit the article and examine any edits subsequent to the identified edit. I presume that's physically possible, but by definition it's not an action that can take place at the time the original report is filed, unless review of potential offending edits occurs well after creation. It would also mean a different type of examination. I presume now contents of an edit is examined and compared against a database of existing material, but examining subsequent edits might have to look at edit summaries or indicators that the material is revision deleted. Sounds possible but it sounds like a very different action than is undertaken to identify potential violations.
Note in this particular example there are two edits that could trigger the removal. There is the edit by Gobonobo which reverted the edit in question and then the later edit by zzuuzz to do the revision deletion. If we were to push for a change of the sort should it be restricted to revision deletions or should it also picked up ordinary edits removing material? Should the subsequent review identifying that the original offending material has now been removed simply remove the entry from the database or should it trigger an update to the report identifying that it may have been addressed? Sphilbrick (talk) 15:10, 28 September 2023 (UTC)
I will say that the new version (now at https://copypatrol-test.wmcloud.org/ – OAuth login not working yet) will show any tags associated with an edit such as "reverted", so you will have that info upfront.
We can also easily check if an edit has been revdel'd and indicate it as so in the UI. If you'd rather the system automatically remove them, I can make it so, but it sounds like @Sphilbrick is questioning if that's always what we want? I would guess that any subsequent edits that are also copyvios will also show up in the feed, or at least they should. MusikAnimal (WMF) (talk) 17:28, 5 October 2023 (UTC)

No recent repots

The most recent entry is 18 October Sphilbrick (talk) 15:17, 2 November 2023 (UTC)

@MusikAnimal (WMF) and MusikAnimal: Any idea as to why this might have been? I'm seeing reports from today now. — Red-tailed hawk (nest) 15:53, 3 November 2023 (UTC)
There was an iThenticate outage on November 2. That would be why there were fewer reports around then. Beyond that, when viewing "All cases", I'm seeing a normal stream over the past several weeks. MusikAnimal (WMF) (talk) 21:50, 3 November 2023 (UTC)

New backend coming soon

Tracked in Phabricator:
Task T333724

Hello all! I'm here to inform you a new backend (bot) that powers CopyPatrol will soon be updated. I've been working with @JJMC89 on this for quite some time. We now have a demo ready, and are asking you all to see how it fares alongside the legacy feed powered by @EranBot.

You can check out the new feed on our staging instance at toolforge:plagiabot. Feel free to test out saving reviews there for the time being, as it is using a test database, but note the production CopyPatrol should still be tended to as well.

Our main concern is the volume of cases that appear in the new feed versus the old. We worry many of these are false positives, and we may be putting too much burden by cluttering the feed with illegitimate cases.

Other questions, which may effect the number of cases reported by the bot:

  • Should the bot skip reverted edits? We're planning on changing it so that it doesn't, and for CopyPatrol to clearly indicate which edits have already been reverted, and if you are a sysop, we'll provide a link to revision-delete the diff. Do you agree with this approach?
  • The new backend checks replaced text, and not just added text. We hope this surfaces more copyvios, but it may be leading to too many false positives. Let us know if you have any thoughts on this.
  • The current threshold for matching text against a source is 50%. We're wondering if that should be changed at all.
  • Compared to the old feed, the new one surfaces many more sources, including non-internet sources. Some such as this example have over 30 sources. Is this overkill? Maybe we should collapse the sources in the view to say, 10 maximum, or just omit showing them at all? This is with the understanding that sources towards the top will have a higher matching percentage.

Feel free to leave your thoughts on the associated task (phab:T333724), or here in this thread. Pinging a few of our most prolific users: @Diannaa @Moneytrees @Sphilbrick @L3X1 @DanCherek @Ymblanter @Framawiki

Thanks for your feedback! MusikAnimal (WMF) (talk) 21:40, 16 August 2023 (UTC)

Hi @MusikAnimal (WMF). The new tool is listing a huge number of cases: 521 cases are listed for August 16, for example, where the original CopyPatrol only listed 108. That's an impossible number of cases for us to complete given the number of patrollers we have that work on this task daily. I can only do about 20 cases per hour tops, and often a lot less. Even with the old version of CopyPatrol, if a key person misses even one day, we have difficulties. So that has to be fixed.
Something I see in the old version that I am not yet seeing in the new version: When I click on the iThenticate link, the old version tells me the date the source was crawled. That can be a helpful clue to help determine if the material was copied from elsewhere on Wikipedia or if it's a true copyvio, so I would like to see it included.
We don't need to see a huge list of possible sources. This is especially true where the edit itself is tiny. Typically a lot of the potential sources are replicating the same material. Here is an example. All the editor did was move some prose from an image caption into the body of the article. If an editor has added a lot of copyvio from multiple sources, it's usually noticeable right away from the page history, and can be checked with Earwig's tool.
I love that you've added the ability to search within the loaded pages on the iThenticate report. That is impossible to do in the original version of CopyPatrol, at least on my setup. That's all for now. Diannaa (talk) 02:15, 17 August 2023 (UTC)
I am getting an error message when I attempt to mark a case as "Page fixed" or "No action needed". it says, 'Something went wrong. Please try again.' Diannaa (talk) 02:47, 17 August 2023 (UTC)
Ah, that's a glitch I must have recently introduced. I'll fix in soon, but for now you can ignore the reviewing process since it's identical to the old one, anyway. MusikAnimal (WMF) (talk) 00:59, 18 August 2023 (UTC)
This should be fixed now. MusikAnimal (WMF) (talk) 01:58, 18 August 2023 (UTC)
Hello. As there's unresolved Eranbot listings from 2015 to 2016, I would like to request all of these listings to be restored to check if they were already resolved. Currently, listings before June 20, 2016 are not at CopyPatrol per Phab. Thanks! MrLinkinPark333 (talk) 19:34, 17 August 2023 (UTC)
Hi @MrLinkinPark333! As per the phab task, those old reports are still accessible in the EranBot archives. There is no viable means to import them into CopyPatrol, I'm afraid. MusikAnimal (WMF) (talk) 00:36, 18 August 2023 (UTC)
Okay. Thank you for the update! MrLinkinPark333 (talk) 00:53, 18 August 2023 (UTC)
en:User:EranBot/Copyright/Batches lists all the pages where the postings were made, and the work that I did to clean them up before we initiated the CopyPatrol interface. If you wish to investigate those reports, you could do so from those postings. The iThenticate links no longer work though. But I don't think that's a good use of editor time; old cases are very difficult to solve, and we already have a huge amount of work between CopyPatrol, en:wp:CCI, and en:WP:CP, and very few people willing to do it. Postings from Batch 46 forward would not need to be checked, because the are duplicates of items that were also listed at CopyPatrol and we dealt with them as they happened on a daily basis. I switched over to working the CopyPatrol queue somewhere around June 17, 2016, and don't have time to do any of those old reports in addition to the hours I spend daily on the CopyPatrol queue. Diannaa (talk) 00:42, 18 August 2023 (UTC)
The iThenticate IDs still work, the URL was switched to a new one when Copypatrol was introduced so appending them to https://copypatrol.toolforge.org/ithenticate/<ID> works. A lot of the pages are already deleted/the additions are long gone, through all the blacklisted links have to be removed as well. It's still probably worthy looking at though. Isochrone (talk) 09:41, 18 August 2023 (UTC)
I am very interested in participating, although I am on a bus in Slovenia at the moment, with a packed schedule, so we will see. Sphilbrick (having issues with login so will post logged out) 188.198.37.7 08:59, 19 August 2023 (UTC)

Mirrors

A new suggestion: We spend an inordinate amount of time repairing unattributed copying within Wikipedia. If some of the more common Wikipedia mirrors could be identified and whitelisted, it would reduce the amount of time we spend on that, which is not as serious a violation as a true copyright violation (copying copyright material from external news sources, books, or elsewhere). There's already a whitelist at User:EranBot/Copyright/Blacklist but some of the ones I frequently see are not listed there: Bookpedia and Handwiki, for example. Diannaa (talk) 15:10, 19 August 2023 (UTC) Adding: It looks like "Wikia" is on Eran's list; but it's now called "Fandom". Should we whitelist that? Diannaa (talk) 15:38, 19 August 2023 (UTC)

Or perhaps pages with a high-similarity to existing articles could be marked as such on the UI to quickly identify/filter CWW, as for removing mirrors the list at en:WP:MIRRORS is quite extensive and machine-friendly.
N.B. are the iThenticate links meant to be broken? Isochrone (talk) 17:52, 19 August 2023 (UTC)
@MusikAnimal (WMF): we are getting an error message when attempting to view iThenticate reports in the new version. 'Oops! An Error Occurred. The server returned a "500 Internal Server Error"'. Diannaa (talk) 09:44, 21 August 2023 (UTC)
@Diannaa Fixed Sorry about that! If it wasn't obvious, this new version of CopyPatrol is a complete rewrite, so some bugs were expected. We'll get everything fixed before we go "live", though :)
I'll also note that I just got a 500 error from iThenticate itself. I just refreshed and the report loaded fine, so if you run into this you can try the same. If it happens a lot, we'll report it to Turnitin. MusikAnimal (WMF) (talk) 16:29, 21 August 2023 (UTC)
I should have mentioned, the new ignore lists are centralized at User:CopyPatrolBot/UrlIgnoreList and User:CopyPatrolBot/UserIgnoreList. Please feel free to edit them as desired. Before we deploy the new CopyPatrol, we'll ensure all the entries are copied over from the old ignore lists, so don't worry about that. MusikAnimal (WMF) (talk) 16:34, 21 August 2023 (UTC)
I don't have any knowledge of Regex so I won't be able to add any urls myself unfortunately. Diannaa (talk) 16:48, 21 August 2023 (UTC)
Yeah, I was wondering if it would be possible to leverage the recently introduced w:en:Special:BlockedExternalDomains system. Just as with the Spamblacklist, the CopyPatrol URL ignore list almost never truly needs regular expressions, rather just plain URLs. Pinging @Ladsgroup for input. I'm happy to file a ticket for this as well as help code and review this effort, if we don't think it will be terribly hard. So basically we'd like to generalize the UI, something like Special:EditUrlList/Pagename.json. I imagine there are other use cases beyond Spamblacklist and CopyPatrol ignore list. MusikAnimal (WMF) (talk) 17:03, 21 August 2023 (UTC)
Sure thing. I don't think it's too hard to make that happen. Amir (talk) 03:58, 24 August 2023 (UTC)
Bug filed. Thanks, Amir! MusikAnimal (WMF) (talk) 23:51, 29 August 2023 (UTC)
Hi @MusikAnimal (WMF), is  User:EranBot/Copyright/Blacklist  still used? Because it looks like it is still the one maintained by patrollers Framawiki (talk) 17:01, 11 December 2023 (UTC)
@Framawiki Yes, until the new version goes live, that's the page to use. A redirect will be left when it is changed. We're still waiting on the final approval from Turnitin before we switch everything over. MusikAnimal (WMF) (talk) 02:17, 19 December 2023 (UTC)

Edit summaries

I just started looking at the new tool. I don't yet have comments on the new tool per se, but since the code is being worked on thought I'd throw out an idea that I would find helpful, and I think it would be pretty easy to implement.

In a nutshell, I propose that the edit summary be posted as part of the information displayed about the identified edit.

I am fully aware that as soon as I click on the diff button, I can easily see the edit summary, so you might be puzzled why I would want it on the case listing page. My rationale is that I have found, through experience, that looking at the edit summary is one of the most important things to look at because it will help define my process. For example, if the edit summary is "rvv", I'm not going to start with the type that report to see if the text matches some other source, I'm going to look at the history to see if the edit summary is accurate and this is a false positive, because the edit reverts to an earlier version and the matching text arises because the earlier version is in some mirror.

In contrast, if the edit summary states "material copied from {some other article], see that article for attribution", my process will be a little different.

"So what", you might be thinking, because I'm always going to click on the diff where I can see the edit summary. The point is that I have different processes depending on the edit summary, and I think it would be more efficient if I could glance at edit summaries and work on similar issues as a group. So, for example, I could glance down the page and look for all of the edit summaries containing RVV, or revert to earlier version or something similar, handle all of those, then come back and look for all of the edit summaries indicating it's a copy from another article, handle all those, and then look for another group of similar articles. Maybe my age is showing, but I don't switch gears is easily as I used to, so I would find it more efficient if I could handle half a dozen reports consecutively where my process is the same, then switch to a different type.

If this only helps me it's not worth implementing, but if someone else finds this potentially useful, I think it's almost a trivial change, copiy the edit summary and place it on the report somewhere. (My simple suggestion would be to just drop it below the iThenticate report button, but if there is an easier option, as long as it's always in the same place I'll be happy.)--Sphilbrick (talk) 12:06, 29 August 2023 (UTC)

@Sphilbrick Ask and you shall receive :) In addition to edit summaries, I've also added tags and the edit size. The tags are especially useful I hope, as they will tell you if it's a revert, or if it was reverted. In the latter case, I was thinking of providing a "revdel" link next to the "Diff" link for quick access to the revision delete form. Would that be useful? MusikAnimal (WMF) (talk) 00:15, 30 August 2023 (UTC)
This is great. (wish you had been at my board meeting last night, a lot of asking, and not a lot of receiving:). Yes, easy access to the Revdel button would be nice. Edit I just noticed you said I have rather than I will; very nice thanks. Sphilbrick (talk) 10:40, 30 August 2023 (UTC)

Review comments

I'm not sure of how difficult this is, but perhaps adding a review comment button (i.e. under the resolve options) would be useful, as opposed to more options as previous proposed? I know this is mainly a focus on backend changes and I can file a task on Phab if appropriate, but for some cases it may not be obvious to other "patrollers" about the action taken.
I can make a little mockup if that helps. Thanks for all the work you and the comtech team are doing. Isochrone (talk) 19:52, 30 August 2023 (UTC)

I believe what you're asking for is basically the same as phab:T279083, only more generalized. I was thinking we could allow adding any arbitrary comments, but also have a dropdown of commonly used ones. That list can be configurable by CopyPatrol users.
With the new system this is all much easier to implement, so I will look into it :) MusikAnimal (WMF) (talk) 21:17, 1 September 2023 (UTC)

Pre-filled revision deletion

You mentioned the possibility of a link to the revision deletion template.

This reminds me of something I've always wanted to ask for, but didn't think I could justify setting up a project for this small request. However, if you are actively working on a new version maybe now's the time.

I use a number of the options in Special:RevisionDelete when generally working on RD1 requests, But if I am carrying out a revision deletion in the context of copy patrol work, four of the five choices are identical in close to 100% of the cases. It would be nice and helpful if a customized RD1 template came up when doing copy patrol tasks.

I would preset the template with:

  • Delete revision text Set
  • Delete edit summary Do not change
  • Delete performer's username/IP address Do not change

Reason: <Pre-fill with the RD1 option>

To put it differently, the standard invocation has three visibility restrictions for which the default is "do not change" for all three. Change the first default from "Do not change to "Set". The reason field is a drop-down box allowing the editor to choose from seven options. There is no default, so prefill or make the default the RD1 option.

There is also a field for "other/additional reason" I don't know about other editors but I typically use that field to add the URL of the copyrighted source material. I fully grant that changing the first field, and selecting the reason is only a couple of clicks, but a couple of clicks repeated ten thousand times adds up. This customized template would mean I could just drop in the source URL which I typically already have in my buffer and complete the RD1 in half the time.--Sphilbrick (talk) 13:10, 31 August 2023 (UTC)

@Sphilbrick Done! I've also added an undo link (as you wouldn't usually rollback here), and a Delete link for new pages. The latter fills in the deletion summary with G12, and also supplies the top source URL. I can't do the same for Undo and also have the automated summary (undid revision by so-and-so), unfortunately. MusikAnimal (WMF) (talk) 21:13, 1 September 2023 (UTC)
Oh, I should mentioned however that the deletion reason auto-selection only works for English Wikipedia, as we must hard-code the value. This doesn't scale well and is fragile (i.e. if someone changes the copyvio reason at en:MediaWiki:Deletereason-dropdown then our code must also be updated). Longer-term, I was thinking we could have an interface page where admins for said wiki can customize the links in CopyPatrol. This would allow CopyPatrol users to update the deletion reason as needed without developer intervention, and also allow each wiki to customize links that meet their workflows. MusikAnimal (WMF) (talk) 21:20, 1 September 2023 (UTC)
I tried the revdel option and loved it. I didn't even ask to prefill the the source as I thought that was asking too much but at least in this case it worked. I was able to invoke the revdel and complete it with a single click. KUDOS Sphilbrick (talk) 13:05, 2 September 2023 (UTC)

Match percentage

The percentage shown next to each "Compare" line now shows two places after a decimal point instead of rounding to a full percentage points. Is this something that was requested? It probably doesn't hurt anything and if there is a value to the increase places, fine but I can't think of a situation where I would need the numbers to the right of the decimal point.--Sphilbrick (talk) 11:41, 3 September 2023 (UTC)

Missing reports

I understand this is a complete rewrite, so one shouldn't expect the exact same set of cases in the rewrite and the legacy. However, I am puzzled to see this page: Draft:Nordic Film & TV Fund show up in the legacy not in the rewrite. It may be gone by the time you see this but it was a 93% match and essentially a copy paste from the about us page for the organization. I notice in draft space but I do see some entries in draft space in the new version so I am puzzled why this one wasn't picked up. Sphilbrick (talk) 12:26, 4 September 2023 (UTC)

It is at toolforge:plagiabot/en?id=8ef096d3-d98d-4c31-9d25-dcbf294c2286. — JJMC89(T·C) 17:24, 4 September 2023 (UTC)
Thanks, wonder how I missed it. Good to see. Sphilbrick (talk) 22:02, 4 September 2023 (UTC)

Damage score

I noticed "damage score" for the first time today. Example:

Link

My very cursory review of the entries on the current page identified three examples.

Can you explain what this means?--Sphilbrick (talk) 22:28, 14 September 2023 (UTC)

@Sphilbrick It's the same as the ORES score in the old system. ORES has been replaced by new service called LeftWing, so we can't call it "ORES" anymore. The models are still the same, though, in this case the "damaging" model. I didn't add a link yet as I assume the Machine Learning team will move the documentation now that it has new name. MusikAnimal (WMF) (talk) 03:07, 16 September 2023 (UTC)
OK thanks. Sphilbrick (talk) 12:49, 16 September 2023 (UTC)

Do we even need damage scores?

On the topic of damage scores (previously called ORES), I'm wondering just how useful this information is for CopyPatrol users. I ask because it is by far the slowest part of the application, especially with the new LeftWing system that replaced ORES that requires us to make a separate request for each revision, instead of doing a bulk query. Once we fetch a damage score, we cache it, but since the feed is constantly updated, it usually will take a while on the first load of a session. If we take out damage scores entirely, you should experience a signficant performance improvement. Pinging a few top users for feedback: @Sphilbrick, Diannaa, DanCherek, L3X1, Ymblanter, and Moneytrees: Thanks, MusikAnimal (WMF) (talk) 17:32, 5 October 2023 (UTC)

I guess I should note that since the old ORES system is now gone, I had to disable it in the old UI entirely, so you all have been going at least a few weeks now without "damage" (aka ORES) scores and no one seems to have complained… perhaps I already have the answer I need. MusikAnimal (WMF) (talk) 17:34, 5 October 2023 (UTC)
I am not even sure I know what a damage score is. Unless it is the same as the percentage of text overlap, I am probably not using it at all. Ymblanter (talk) 17:44, 5 October 2023 (UTC)
I see, it is not the same. I am unlikely to use it. Ymblanter (talk) 17:45, 5 October 2023 (UTC)
I don't think I ever used it, user edit count is what grabs my attention first then I go straight to the diff enL3X1 ¡‹delayed reaction›¡ 01:44, 6 October 2023 (UTC)
Removing it would not affect my workflow at all. DanCherek (talk) 18:08, 5 October 2023 (UTC)
I don't use it.Diannaa (talk) 21:00, 5 October 2023 (UTC)
Great, thanks for the replies, all! MusikAnimal (WMF) (talk) 00:48, 10 October 2023 (UTC)

Fix the problem by the same user!

It should NOT be allowed to a user to "mark his/her own articles as fixed". Otherwise this tool will NOT be trusted. Here is an example: (https://copypatrol.toolforge.org/ar/?filter=reviewed&filterUser=Kamalelsayedmohamed), his articles have 99% copy from other site, then the user marked them as "Fixed"! or "No action needed"! Dr-Taher (talk) 05:59, 30 November 2023 (UTC)

@Dr-Taher See task T334272. 1AmNobody24 (talk) 06:58, 30 November 2023 (UTC)
Thanks @1AmNobody24, but more than 7 months, and no action is taken! Dr-Taher (talk) 10:05, 30 November 2023 (UTC)
I'll get this implemented in the new version, which we'll be rolling out before the end of the year. However if the intention is solely to prevent misuse, it's worth noting a bad actor can easily get around this by simply creating a new account and using that to review their other account's edits. Perhaps use of CopyPatrol should be limited to autoconfirmed accounts? MusikAnimal (WMF) (talk) 20:23, 30 November 2023 (UTC)
Could there be an option for local communities to associate it with different rights (for example, it could be limited on EnWiki to new page reviewers if the community wants it, since Autoconfirmed gaming is very easy). — Red-tailed hawk (nest) 03:29, 2 December 2023 (UTC)
@Red-tailed hawk There's talk about that here, task T178700. Auto-confirmed globally and either that or Extended confirmed for EN Wiki (@MusikAnimal (WMF) your task ) Nobody (talk) 13:06, 4 December 2023 (UTC)

Can the tool access paywalled full texts?

Curious whether this tool would detect violations like this from 2015 which copied from this source(you'll need to log in)? If not, have you considered whether the tool can be linked up with The Wikipedia Library to access full texts? Smartse (talk) 10:59, 19 December 2023 (UTC)

@Smartse I tried it by copying that old version to Draft:Sandbox. CopyPatrol picked up the edit [1]. In the iThenticate-Report it shows that source as a 13% match. Nobody (talk) 13:16, 19 December 2023 (UTC)
@1AmNobody24: Thanks for that - I see that percentage at 9% for link.springer.com, but looking at https://www.ithenticate.com/ I see that they do indeed have the full texts for many paywalled articles. Good to see that we should catch edits like this today, but I wonder how many we missed! Smartse (talk) 12:29, 21 December 2023 (UTC)

Question about marking edits

When I encounter an edit that somebody else has already fixed (by removing content and adding copyvio-revdel tags, or by tagging for G12), should I mark the edit as "Page fixed" or as "No action needed"? I've been marking these sorts of things as "Page fixed", since it was a true copyvio and the page was fixed, but the use of you in If you fixed the problem, tagged the page for revision deletion, or tagged the page for deletion as a copyright violation, mark it as "Page fixed" is now giving me a bit of pause. — Red-tailed hawk (nest) 02:54, 21 December 2023 (UTC)

@Red-tailed hawk I also mark those as Page fixed. You think something like If the problem is fixed, the page tagged for revision deletion, or tagged for deletion as a copyright violation, mark it as "Page fixed" could be better? Nobody (talk) 06:27, 21 December 2023 (UTC)
I think the proposed text would work well, yes. — Red-tailed hawk (nest) 16:38, 22 December 2023 (UTC)

User whitelist

Is that list still working? Cause this came in. Nobody (talk) 09:52, 22 December 2023 (UTC)

Sometimes things slip through; I don't know why. Diannaa (talk) 15:35, 26 December 2023 (UTC)

Is copy patrol down?

only 4 cases going back quite some hours enL3X1 ¡‹delayed reaction›¡ 21:34, 25 December 2023 (UTC)

I'm not seeing any significant gaps, just a general slowdown. I guess people had something else to do on Christmas Day. Diannaa (talk) 15:34, 26 December 2023 (UTC)