Learning patterns/Collecting data on offline meetups
What problem does this solve?
[edit]Offline face-to-face meetings are an important part of the Wikimedia community. However, there are only few systematic assessments of their effects of such meeting on online behaviour as it can be difficult to obtain the meeting data. While meetups are most often organised publicly (at least concerning the project of Wikipedia), these pages can be difficult to find and scrape. In this learning pattern, I want to highlight how the meeting data can best be found, collected, and cleaned, as well as discuss problems that can arise in the process.
What is the solution?
[edit]This learning pattern focuses on the project Wikipedia.
Step 1: Finding the data
[edit]Meetups are generally organised publicly, but how can they be found? Most meetups, if not international, are organised on the specific language version of Wikipedia. Collecting these meetings requires adequate language skills. As the data is not process-generated but user-written, it comes in an unstructured format, might have errors and exceptions, and needs to be read and understood.
When working with Wikipedia meetup data, like any other data, the first and most important step is to answer the question which data one really needs. Do you want to assess informal meetups, more formally organised edithatons, all GLAM-related events, all gatherings at community spaces, or something else? This will guide you on knowing what to collect and where to search for it.
The starting point for any data collection on informal meetups needs to be the meetup overview page of the language version under consideration: For example, the English overview page, the German one or the Spanish one; other language versions can be found through the interlanguage interwiki links. How these overview pages are set up depends on the language version. In case of the German version, the overview page includes the links to over one hundred pages of regions and cities where Wikipedia meetups between German speaking Wikipedians are/have been organised and archived. These regionally based meetups are most often informal meetings with the main focus being socialising in public spaces.
Editathons are often listed on other pages which need to be checked in the relevant language version: For example, an incomplete list of Editathons in the English language is here while the German pendant is more extensive de:WP:Edit-a-thon. Editathons are events where (potentially new) editors of Wikipedia meet to edit and improve a specific topic or type of content, and generally include basic editing training for new editors.
These overview pages already link to various other ones that might be relevant. In collecting meeting data, it is important to follow a snowball approach and to keep a record of links to relevant pages, as well as to remember which pages were already visited and which still need to be visited. For example, in the German case, one will also stumble on the open editing events, another sort of editing meetup.
Again, depending on the language version, there are also categories of events. In the German Wikipedia, these are there are categories of general events and events covered by an insurance (including activities such as attending and looking after stalls representing Wikipedia at different fairs, workshops about photography, or events related to the GLAM initiative). Further, the WikiProjects and task forces can be checked as they are central places for specific content. They are used to communicate, collect sources, and provide summaries and support regarding specific topics. They form a sort of virtual gathering place for Wikipedia editors interested to work on a specific cluster of topics and in some instances, the editors involved in this groups also organise meetups.
Additionally, it is useful to check the calendar (and see see the German calendar). The calendar offers a central collection of project-related events and dates, but it cannot be used as a sole source of information because not every organisor is making use of it.
Lastly, meetups can also be organised on meta instead of a language version of Wikipedia.
Step 2: Scraping the data
[edit]After knowing where to find data on meetups, the information needs to be scraped for quantitative analysis. Generally, meetup pages include the date, place/venue, and attendees of a meeting allowing to build a network with time stamps. In most cases, the data also includes apologies and minutes about the meetup. Which data should be collected depends on the reserach question at hand but generally, "more is better" to allow for extensive re-use of the data.
The way the meetings are organised and archived on the organisational pages varies. Generally, there are the following two approaches on how meetings are archived:
- A structured archive of all meetings with a consistent structure. In these cases, every meetup has been recorded, and data on - at least = date and place/venue of meeting are available. In terms of data collection, this is the ideal case as it allows to write an automated script.
- Meetings were not archived at all. The organisational pages were used to organise the most recent meeting. Due to Wikipedia's technical structure, it is possible to retrieve information about past meetings using the version history. In these cases, it is necessary to scan (manually or automatically) through the complete version history to find past meetings before they had been deleted in favour of the next meeting.
These are two ideal types. In reality, they occurr in different sub- and hybrid forms. In the ideal version of case 1), all meetings are archived on a single page and all meetings are being recorded following a consistent structure. In a less ideal version of case 1), all meetings are recorded in archives, but there are separate archives for single years and the structure and format of the archives varies between years. In some cases, organisational pages might not maintain an archive, but at least provide an overview of all meetings and link to the respective pages in the version history. In other cases, only some meetings might be archived; for example, only meetups of a specific kind or for specific years. In the most unforunate case, some meetings are being simply left out. The skipping of meetings can only be noticed when the rhythm of meetings is broken (i.e. they seem to have monthly meetings but there was a month without one) and then the version history needed to be checked manually.
How can such inconsistent data be collected? It is not possible to write a single script scraping all the data as each entitiy that organises meetups can come with their own structure and way of archiving. This requires a script to be adapted to each organiser, and not in every situation is it useful to even write a script: If one region organised only ten meetups, it might be simpler to collect the data manually then to identify a machine-readable structure.
Generally, the following features are important when writing a web-scraper:
- Working with HTML versus WikiText: Should the scraper work with HTML and download the page content, or is it easier to copy the WikiText into a text document and extract patterns from it? This is an important consideration. I have experienced it to be easier to work with formatted WikiText but downloading page content as part of a script increases opportunities for reproducability and repetition.
- How is the meeting information structured? What are key words and patterns a scraper should look for? Generally, archived meetings do have some structure and include key words like "Attendees" or "minutes" and reoccuring semantic patterns like dates and usernames.
- Where can the most reliable information be scraped from? From example, attendance lists should be collected from protocols that were published after the meeting took place, if available.
- Exemptions and problems need to be flagged in some way so they can be double-checked manually.
An example web-scraper (for the Zurich meetups) can be found in my GitHub repository. Feel free to adapt it and use it for your own needs.
Step 3: Cleaning the data
[edit]Before the meetup data can be analysed in any further way, it will most likely require substantial data cleaning effort as the data was user-written and not process-generated. This can require the fixing of dates or corrections of attendance lists (for example, if a user added themselves to the meetup many years later without actually having attended).
Depending on your research interesting, it might also be necessary to connect people to usernames; if your analysis focuses on Wikipedia user accounts, this is not an issue. However, if you want to merge meetup attendance with meta data about online behaviour, and if you are interested in the people behind Wikipedia accounts, names should be matched to people and name changes need to be consolidated. Users on Wikipedia can request name changes which are usually granted. Additionally, it is also generally allowed to just sign up as a new user, allowing for greater anonymity. After renaming, the old name will generally direct to the new name, and users can also set up their own redirection links. Also, since August 2008, all user accounts are set up as global user accounts single-user-login which came with the cost of a significant renaming effort.
All redirection links linking to one user can be collected using the API, and the renaming logbook can further be scraped. This allows to create a list of all current users and their redirections and previous usernames, in as so far they requested an official rename or linked to other accounts using redirection lists. In some cases, users also explicitly mention when taking part in meetings. Such instances need to be recorded. In cases where users created a new account and want to gain more anonymity, it is impossible to link them to their previous name.
Is the problem already solved?
[edit]Before you dive into data collection, check whether the data has already been collected by another busy bee.
Please add yourself to the following list if you have collected meetup data:
User | Wikimedia project | Time period covered | Meetups collected | Data access |
---|---|---|---|---|
ASociologist | German Wikipedia | 2001 - March 2020 | All meetups except regular events at community spaces | Not yet publicly available, please get in touch if you would like to use the data. |
ASociologist | Allemanic Wikipedia | 2001 - March 2020 | All meetups | Not yet publicly available, please get in touch if you would like to use the data. |
Add yourself | ||||
Add yourself |
When to use
[edit]- Use this pattern when you want to better understand offline meetups of a Wikipedia community.
- I have used this approach in my project.
Endorsements
[edit]-- ASociologist (talk) 13:36, 20 May 2022 (UTC)