Jump to content

Grants:Project/DPLA/Extending the DPLA digital asset pipeline to improve quality and discoverability

From Meta, a Wikimedia project coordination wiki


statusselected
Extending the DPLA digital asset pipeline to improve quality and discoverability
summaryFunding software development and prototyping to improve the DPLA digital asset pipeline by implementing Structured Data on Commons, continuous data synchronization, improved analytics for partners, and experimentation with SDC-based image citation solutions.
targetWikimedia Commons, Wikipedia (primarily English)
amount$54,684
nonprofitYes
granteeDominic
contact• dominic(_AT_)dp.la
organization• Digital Public Library of America
join
endorse
created on17:38, 8 February 2021 (UTC)


Project idea

[edit]

What is the problem you're trying to solve?

[edit]

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

In 2021, the Digital Public Library of America is seeking to enact a Wikimedia program that will support GLAM-Wiki efforts in the United States. The cornerstone of that program is the digital asset pipeline we have built in 2020, which allows any DPLA contributing institution to opt-in to regular, automatic uploads of their media files to Wikimedia Commons—as long as they meet the data requirements—without any software development or bot work of their own. This project allows DPLA to offer a service to its members with measurable impact, while also providing value to Wikimedia in the form of high-quality cultural heritage content and edits from newly-trained information professionals. The system was developed entirely by DPLA without any prior use of Wikimedia funding.

DPLA is a huge potential hub of content for Wikimedia projects. In the pilot year of this pipeline, DPLA uploaded over 1 million files to Wikimedia Commons, and has already reached the 2 million mark just a few months after that—making DPLA the largest single contributor to Commons. But this number still represents less than 5% of public domain items aggregated by DPLA today (and nearly any cultural institution not already contributing to DPLA's aggregation could join the network if they would like to participate in the pipeline to Commons). What has been built so far, therefore, is an effective firehose, but we also recognize that in order to sustain the project's growth and preserve the goodwill of the Wikimedia community, we need to increasingly focus on delivering quality and not just quantity. In particular, as the Wikimedia Commons community moves toward more widespread adoption of Structured Data on Commons, DPLA needs to be part of that effort, both as the largest single source of media, but also as a stakeholder whose own success depends on improved discoverability.

What is your solution to this problem?

[edit]

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

In order to improve the quality and extend functionality, DPLA has identified four discrete enhancements we propose to make to the pipeline. These projects would help us unlock further impact for our proposed Wikimedia program for Glam-Wiki in the United States.

Implementation of Structured Data on Commons
Use of Structured Data on Commons statements on uploaded files was always part of the vision of our pipeline, but it was too challenging to accomplish in the pilot phase. There were both technical reasons (Pywikibot, which we were using for uploads, did not yet support SDC) and social reasons (lack of a clear consensus on data modeling) which made it the wrong time to begin adding millions of SDC statements. Since then, the landscape has changed, and there are more tools and community clarity around SDC. We believe it is reasonable now to implement structured data statements on the DPLA pipeline's uploads, and that this would enhance discoverability of the images as well as improve the usefulness of the data itself.
Data synchronization
Data synchronization means that rather than treating each upload to Wikimedia Commons as a one-off contribution, which is how most bulk uploads work, DPLA will continue to maintain that upload over time. The pipeline will perform additional non-upload edits to the pages on Wikimedia Commons, such as when the metadata from the data source change, and renaming files when the title field changes. This ensures the quality of the data, since metadata is not static, and can change over time for a variety of reasons, ranging from simple factual or grammatical fixes to addressing outdated or harmful language. We would also use this functionality to propagate the implementation of SDC statements back into the now millions of legacy uploads DPLA has already performed.
Analytics reporting
The statistics DPLA and its partners expect in order to measure the impact of our efforts on Wikimedia come from a number of different places. Some tools, such as BaGLAMa, GLAMourous, and GLAMorgen can be used to find data—such as number of uploads, files in use, or page views—but they require specialized knowledge for institutional staff to find and use these tools. DPLA already provides an Analytics Dashboard to partners, which displays statistics about their collection's metadata coverage and usage, and we plan to provide Wikimedia data in an easily accessible form by building it into the dashboard.
Image citation pilot
In order to help demonstrate the value of our SDC work, DPLA is committed to experimenting with new solutions our work makes possible. One idea we would test is how the storing captions and item-level metadata as structure data on Commons files will allow Wikipedia editors, especially GLAM staff new to editing, to more easily add images. Please see our initial mock-up, which envisions descriptive metadata from GLAM collections brought into an article (using Lua) as citation data and captions. This is based on the original concept described in Wikipedia:Image citation, but extended to the new capabilities of SDC. As part of this project, we would conduct a basic investigation of this approach, with template coding and user testing with DPLA partners.

Project goals

[edit]

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

  • DPLA will increase the usefulness of millions of Wikimedia Commons media files by converting template text to structured data' and providing regular updates to Commons file metadata.
  • DPLA will produce reports and prototypes that demonstrate the value of its Commons uploads, and the promise of SDC generally.

Project impact

[edit]

How will you know if you have met your goals?

[edit]

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

  1. Goal 1: DPLA will increase the usefulness of millions of Wikimedia Commons media files by converting template text to structured data and providing regular updates to Commons file metadata.
    • Output: We will be developing new features for our existing pipeline that will for the first time begin to use structured data statements, and will allow the bot to make edits to past uploads to add the statements and also to make any updates to metadata that has changed since upload time. Having already uploaded millions files to Wikimedia Commons, our expected outcome for this goal is at least 2 million edits to add structured data and make metadata updates in the grant period, plus continuing this process for all future uploads.
    • Outcome: The promise of structured data is that it makes Wikimedia Commons images more discoverable and its data queryable. This will be a particular value to DPLA's project, because, having uploaded over 2 million files to Wikimedia Commons, we are highly interested in doing everything we can to surface those images and make the uploads more useful for both regular Wikipedians and are own partners, who we are training on editing Wikipedia. The more use the images get, the more impact Wikipedia's readers will see, in addition to DPLA's partners. The data synchronization plan is part of this effort, because it will be our approach for applying structured data to our past uploads, but it will also maintain the accuracy of DPLA's uploads over time.
  2. Goal 2: DPLA will produce reports and prototypes that demonstrate the value of its Commons uploads, and the promise of SDC generally.
    • Output: This goal is supported by the work DPLA will do to bring Wikimedia analytics data into its Analytics Dashboard, and the image citation prototype. The outcome will be a Wikimedia statistics page in production on DPLA's dashboard, which is available to partners. The image citation prototype will produce templates and mockups functional enough for demonstration purposes, as well as a written write-up summarizing results of our testing with actual users (from a DPLA institutional partner).
    • Outcome: The reason for the components of our project plan relating to the DPLA Analytics Dashboard and the image citation prototype is to ensure impact goes beyond the data that lives on Wikimedia Commons. The analytics we report to our partners are not just about showing the results of our work, but are our main motivator for incentivizing continued and new participation, so this will indirectly lead to further content for Wikimedia projects.

      Our hope is that the image citation prototype would help provide insights into a use case for file captions and structured data on Commons that has not yet been fully explored. This particular use case is of interest to GLAMs, which (based on our conversations) see their digitized collections as informational content whose metadata is worthy of inclusion in an article's reference list—as most academic works do. The product of our work might help WMF scope or prioritize future SDC development work, or lead to new citation templates that could be adopted by the community.

Do you have any goals around participation or content?

[edit]

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.

Our project will have a major effect on improving the content of Wikimedia Commons, per the third shared metric, and so we have included a numeric target under Goal 1. In addition, we expect indirect improvements to Wikipedia content through the software development improving discoverability for Commons images to be used in Wikipedia, and the citation prototype's potential for improving reliability in future iterations.

Project plan

[edit]

Activities

[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

Implementation of Structured Data on Commons
The implementation of Structured Data on Commons will be the largest part of the activities planned.
  1. There will be intellectual work here to map DPLA fields to Wikidata properties to be used in SDC statements.
    • We will attempt to convert as many fields as possible to structured data over time, but we would at least start with most straightforward ones, such as "title", "author string", "copyright", "collection", etc.
  2. In addition to the field mapping, there will be some amount of entity mapping.
    • For example, DPLA has a controlled set of rights URIs, which would be added using P6426 or P275, but first we need to identify the Qids for each of the possible licenses' items.
  3. DPLA needs to develop a new system of transforming original metadata from DPLA's aggregation into the set of wiki text and SDC statements that is saved with Commons uploads.
  4. DPLA will also work with the Wikimedia Commons community to enact any needed changes to metadata templates (e.g. using Lua) in order to adopt data from SDC statements.
  5. Finally, the Pywikibot code for User:DPLA bot will need updates, as SDC statements are edited using different methods than updating wiki page content.
Data synchronization
Data synchronization is a complex task which involves detecting differences in the uploaded state of metadata on Wikimedia Commons as compared to the latest metadata in DPLA's aggregation and then making edits to update the content of the wiki page to the newest metadata (without losing any useful contributions from the community). Developers here would
  1. Investigate best ways for determining Wikimedia Commons page content at scale for comparisons (database dumps? live database access?)
  2. Regularly regenerating metadata for past uploads after updated ingests from DPLA partners to detect changes.
  3. Writing new Pywikibot processes for User:DPLA bot to save new metadata or move files as needed.
Analytics reporting
This would help to fund development work on DPLA's Analytics Dashboard, which is used for reporting outcomes to our project partners. This includes:
  1. Scripting and database work to collect data on upload counts, file usage, and page views from various data sources (e.g. Wikimedia Commons API, AQS, and BaGLAMa)
  2. Associated UI work to incorporate data into the web view.
Image citation pilot
The work of the image citation pilot will include:
  1. Producing an experimental Lua-based template which, using the DPLA SDC data model, can automatically generate a caption and reference for an image using an image's structured data.
  2. Testing use of the template with partner librarians, as a possible method for introducing new users.
  3. Making a write-up with learnings, and recommendations for next steps.

Budget

[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

DIRECT COSTS
PERSONNEL COSTS
Title/Role Monthly Base Months Fringe Rate Funds Requested
Software Developer $8,292 3.00 24.00% $24,876
Program Manager $9,166 1.00 24.00% $9,166
Intern $3,000 2.50 0.00% $7,500
SALARIES AND WAGES $41,542
FRINGE BENEFITS $8,170
TOTAL DIRECT COSTS $49,712
INDIRECT COSTS
Total Direct Cost (TDC) Rate: 10.00% $4,972
TOTAL COSTS $54,684

Community engagement

[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

The content donations to Wikimedia supported by this software development are intended to provide value to the Wikimedia community. In our Community Organizing proposal, we are committed to awareness-raising activities in order to ensure usage of the images uploaded. In addition, for the specific technical projects laid out above, community input will be necessary. Both the SDC implementation and data synchronization projects will theoretically produce millions of new edits to Wikimedia Commons to alter file metadata, and we would commit to doing that in a way that is most useful to the community and in accordance with its wishes. We believe this will be possible, as we are already deeply involved in the community and have participated in conversations surrounding Structured Data on Commons.

As we have also noted in the Community Organizing proposal, this project also engages stakeholders external to Wikimedia projects—namely DPLA's partners in the US GLAM sector. We continue to believe that our Wikimedia program provides value to the cultural institutions, in addition to the Wikimedia projects, by generating impact for their work. We communicate outcomes with these partners on a regular basis, including in newsletters, blogs, webinars, and staff contacts, and with the participants we provide actual training and support for Wikipedia editing.

Get involved

[edit]

Participants

[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Digital Public Library of America

[edit]
About
The Digital Public Library of America amplifies the value of libraries and cultural organizations as Americans’ most trusted sources of shared knowledge. We do this by collaborating with partners to accelerate innovative tools and ideas that empower and equip libraries to make information more accessible.
We work with a national network of partners to:
  • Make millions of materials from libraries, archives, museums, and other cultural institutions across the country available to all in a one-stop discovery experience.
  • Provide a library-controlled marketplace and platform for libraries to purchase, organize, and deliver ebooks and other e-content to their patrons.
  • Convene library leaders and practitioners to explore and advance technologies that serve, inform, and empower their communities.
To find out more about our mission, strategy, and work, please see our Strategic Plan.
Size of organization
DPLA is well-aware of the sense within the Wikimedia community that project grants should be directed as much as possible to the individual and organizations that are not already well-resourced enough to take on the projects on their own. For that reason, we want express transparently, despite the size and influence of our network, we believe DPLA is exactly the type of organization worth funding. Despite its broad scope and vision, DPLA itself is a small organization of just 12 virtual staff and an annual budget of about $2 million. Its funding is primarily from grants and membership dues.
Principal staff

For full staff bios of the individuals below, see DPLA staff page.

  • Dominic Byrd-McDevitt, Program Manager
    As one of the most longstanding leaders in the GLAM-Wiki movement, Dominic has nearly a decade of experience managing GLAM-Wiki programs in cultural institutions. He served as Wikipedian in Residence for the Smithsonian Institution, and also for years at the US National Archives (NARA). Highlights of his NARA tenure include pioneering approaches for bulk upload to Wikimedia Commons, successfully executing one of the largest bulk uploads prior to DPLA; developing the GLAM-Wiki Boot Camp model of capacity-building workshops for Wikimedians with Wikimedia DC; assisting in organizing the 2015 WikiConference North America, which was held at the US National Archives; and contributing over 1 million edits to Wikidata in the course of adding NARA archival metadata to items.
    Dominic would serve as the Wikimedia Program Manager for DPLA, having led the pilot phase from its inception. Dominic is DPLA's principal Wikimedian, and brings his wealth of Wikimedia community and editing experience to the role. He is primarily responsible for organizing outreach efforts to the DPLA network, advising on any technical development, providing the actual training and support to the partner institutions, liaising with the Wikimedia community, and directing the work of the intern.
  • Michael Della Bitta, Director of Technology; and DPLA Tech Team
    Michael has oversight over the technical aspects of the Wikimedia program, and also takes part defining the strategy for the Wikimedia program. The DPLA Tech Team, managed by Michael, with its staff of software developers, provides the technical support necessary for the DPLA Wikimedia program. This includes maintenance of the digital asset pipeline to Wikimedia Commons, and any bugfixing or feature development it requires, as well as the technical needs related to metadata ingestion from partners, analytics and reporting, etc. The DPLA Tech Team has fully developed the digital asset pipeline in house, and would undertake all software development in this project.

Advisors

[edit]
  • Giovanna Fontenelle: As Program Officer, GLAM and Culture, Giovanna is part of the Wikimedia Foundation's team working on Structured Data on Commons, and will advise on the project.

Community notification

[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements

[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • Strong support Strong support - I strongly support this project because it leverages the Digital Public Library of America (DPLA), a known and highly developed network of state and regional hubs that have already been highly active in the aggregation of metadata and digital resources from the GLAM sector. The work that Dominic has already accomplished at DPLA with Commons is excellent and is scaling up monthly. For this project to be effective and sustainable over the long term, it has to address the variation and changeable state of metadata. I formerly worked as a librarian and a manager at 2 institutions in Pennsylvania that were founding members of the Pennsylvania DPLA hub PA Digital. So I know that DPLA conducts full data harvests on a periodic basis, during which time the metadata and description for the original items may have changed. Therefore data standardization and synchronization becomes super important, as addressed by this grant proposal, so that item description will be the same across platforms. Based upon my experience as a digital librarian and a Wikipedian, this project is an excellent prototype that will help potentially hundreds (eventually thousands) of GLAM institutions in North America. Dorevabelfiore (talk) 18:38, 24 March 2021 (UTC)
  • Support Support - An important step in inspiring GLAMs to collaborate with Wikimedia. Adam Harangozó (talk) 08:07, 5 May 2021 (UTC)