Jump to content

Research:Ethical and human-centered AI

From Meta, a Wikimedia project coordination wiki
(Redirected from Research:Ethical AI)
Duration:  2018-August – 2019-May
Open access project  Open access
via Figshare  Jonathan T. Morgan, 2019. ​Ethical & Human Centered AI - Wikimedia Research 2030​.
This page documents a completed research project.


Ethical & human centered AI whitepaper on Figshare

AI technologies have the potential to benefit the Wikimedia Movement, but they come with risks. The Wikimedia Foundation has begun to build AI products around these technologies. The emerging domain of ethical AI proposes new approaches for addressing the discrimination, disruption, and damage that AI can cause. The established discipline of human centered design provides guidance on how to maintain a focus on human needs and wellbeing throughout product development.

The purpose of this project is to help Wikimedia ensure ethical and human-centered outcomes in AI product development given our current and anticipated goals, needs, capacities, and workflows. The project makes two contributions: 1) it motivates a set of risk scenarios intended to define the problem space and promote reflective decision-making, and 2) it presents a set of  process proposals for improving AI product development. Taken together, these scenarios and proposals can help Wikimedia address anticipated challenges and identify emerging opportunities to leverage AI technologies to further our mission.

This project page, and the associated white paper, represent an initial attempt to identify requirements for a minimum viable process developing machine learning models, tools, datasets, and other AI products within the Wikimedia Movement in an ethical and human-centered way. This project is oriented towards Wikimedia’s 2017 Strategic Direction. It complements the strategic priorities described in the Research 2030 white papers—Knowledge gaps, Knowledge integrity, and Foundations—by Wikimedia Research and Augmentation by Wikimedia Audiences.

Background

[edit]

AI Products

[edit]

Machine learning systems consist of more than just algorithms themselves. They also include other technological components that allow the algorithm to be trained and used for a particular purpose. Each of these components is designed, each of these components is released; therefore each is an AI product in its own right.

AI products developed by the Wikimedia Foundation include (at least):

  1. Machine learning models: algorithms that uses patterns in one set of data to make predictions about the characteristics of different data.
  2. Curated datasets: data collected or labeled to train machine learning models.
  3. Machine learning platforms: machine-learning-as-a-service applications that host models and provide programmatic access to those models.
  4. AI-driven applications: end-user facing apps, gadgets, and features powered by machine learning models.
  5. Data labeling applications: interfaces for humans to (re)classify or dispute model input and output data.

Ethical AI

[edit]

The canonical definition of what constitutes ethical AI (or ethical behavior generally) is beyond the scope of this report. The general framework for ethical AI used in this report is based on the widely accepted principles of fairness, accountability, and transparency (FAT) viewed through the lens of the values of the Wikimedia Movement.

Given that framework, a basic definition of FAT might look something like this:

  • Fairness: the AI product does not actively or passively discriminate against groups of people in a harmful way.
  • Accountability: everyone involved in the development and use of the AI product understands, accepts and is able to exercise their rights and responsibilities.
  • Transparency: the intended users of an AI product can meaningfully understand the purpose of the product, how it works, and (where applicable) how specific decisions were made.

Human-centered AI

[edit]

Human-centered design is a philosophy and a set of methods for ensuring that any designed thing (artifact, process, system) meets the needs of the people who will use, interact with, or be affected by it. One definition of a human-centered system that applies well to an AI context is:

  1. Designed to address human needs
  2. Based on an analysis of human tasks
  3. Built to account for human skills
  4. Evaluated in terms of human benefit

There are many other definitions of human-centered design, and closely related methodologies such as values-sensitive design and participatory design that prioritize, respectively, investigation and articulation of designer/stakeholder values and direct involvement of end-users in the design process. The definition presented above doesn't exclude either of these considerations; it's more a matter of focus.

Risk scenarios

[edit]

Many AI ethics researchers have begun to develop scenarios as a way to communicate how even seemingly mundane or uncontroversial uses of machine learning can have negative consequences, and to spur discussion. Not all of the risk scenarios described by AI researchers are directly applicable in a Wikimedia context—for example, monetization of user data, or the dangers of autonomous cars. The following scenarios are inspired by those developed by Ethical OS[1] and Princeton University[2], but adapted to AI products currently developed by the Wikimedia Foundation (or at least within the realm of possibility for Wikimedia Foundation products).

The scenarios below are fictional. They are designed around realistic AI products and product use cases within the Wikimedia movement, but they are not statements of fact or findings from empirical research. They are intended to illustrate some of the bad outcomes that might result from seemingly sensible design decisions made by different people at different points in the product development process.

These scenarios are not intended to suggest that any particular person, product, or type of product is biased, harmful, malicious, or fundamentally flawed; rather, they are intended to illustrate some of types of ethical and human-centered design issues that this type of product could present in a Wikimedia Movement context.

Scenario A: Reinforcing existing biases in article content

[edit]

Wikimedia builds a section recommendation feature into the editing interface. This feature uses machine learning to suggest a list of potential section headings for very short articles—based on the sections that already exist in other articles that resemble them—creating hooks to encourage article expansion.

The section recommender learns that Wikipedia biographies about men are likely to have section titles like “Career” and “Awards and honors”, while biographies of women are more likely to have sections with titles like “Personal life” and “Family”. The feature is widely used: it increases the overall quality of short Wikipedia articles, but also increases the systemic bias in the way women and men are portrayed on Wikipedia.

Scenario B: Discouraging diversity in content and contributors

[edit]

Wikimedia adds a draft quality score into the new article review dashboard. The quality prediction model weighs spelling errors and grammatical disfluencies highly when scoring articles, but doesn’t consider the number of citations or the notability of the topic. As a result, articles written by people for whom English is a second language tend to have lower overall scores.

Many content gaps in Wikipedia are around topics that native English speakers tend to be less knowledgeable about or interested in. However, reviewers find the scores work well enough and they can make acceptance and rejection decisions much faster based on the score alone. As a result, reviewer workload is decreased, but good quality new articles on important topics are rejected at an increased rate, and culturally diverse contributors frequently see their hard work deleted.

Scenario C: Lack of transparency and recourse

[edit]

Wikimedia builds filters powered by machine learning into the recent changes, article history, and watchlist feeds on Wikipedia. One of these filters highlights edits that have a high probability of being performed with malicious intent.

A 5-year veteran editor with over 50,000 edits notices that many of their recent edits have been highlighted as likely bad faith, and that their edits are now being reverted at a much higher rate than before the filters were rolled out. After some sleuthing, they notice some patterns and believe they have figured out why their edits are being erroneously tagged as the result of a ‘corner case’—a rare combination of factors related to the kinds of edits they make and the articles they work on is confusing the model. However, they are unable to discover a way to report the issue, correct the faulty predictions, or confirm their suspicions about the cause. In the meantime, the heightened level of scrutiny and rejection they experience from their fellow editors leads to embarrassment, conflict, a sense of alienation, and they consider leaving Wikipedia.

Scenario D: External re-use and harm

[edit]

Wikimedia releases a dataset of Wikipedia talk page comments, labelled by crowdworkers for key words and phrases related to toxic speech. An external developer trains a machine learning model on this dataset, and uses to model to power an automated content moderation system for an online depression support forum for at-risk teens.

Teens experiencing mental health crises tend to use emotionally charged language and words commonly associated (in other contexts) with aggression and hate speech. While the automated system proves effective at deleting trolling posts quickly, it also flags and deletes many legitimate support requests that are appropriate to the forum and permitted under its rules. The messages that demonstrate the greatest need for support are the most likely to be blocked by the tool.

Scenario E: Community disruption and cultural imperialism

[edit]

Wikimedia builds a tool that recommends articles to translate from one language to another. The tool uses machine translation to generate pretty-good translations of articles. The goal is to leverage the content and community of large Wikipedias to help smaller Wikipedias grow.

The translations are good enough that editors from English Wikipedia who know a little bit of Amharic feel confident using this tool to publish lots of articles to Amharic Wikipedia after a quick review and some light clean-up. They appreciate the opportunity to find valuable work that matches their interests and their expertise. Amharic Wikipedia is much smaller than English, has fewer editors overall, and fewer bilingual editors, and the local community is currently focusing on expanding their Wikipedia organically and curating the content they already have.

The Amharic community soon finds themselves overwhelmed by an influx of new imperfectly-translated articles. Although the Encyclopedia grows faster, local editors must now focus their energy on fixing errors and completing partial translations, rather than writing the articles that they are interested in writing, or that they believe are most important to their readers.

Scenario F: Culturally-mediated assumptions of usefulness

[edit]

Wikimedia deploys a new ranking algorithm to power the top articles feed in the English Wikipedia Android App. The previous ranking algorithm was based on a simple pageview-based metric: it reflected what Wikipedia readers are interested in reading. The new ranking is based on a more sophisticated machine learning model that identifies trending articles based on patterns of editing activity associated with breaking news events: it reflects what Wikipedia editors are interested in editing.

Most English Wikipedia editors are North American or Western European. Many members of the app development team are English Wikipedia editors. For them, the new ranking seems to perform better: it surfaces trending articles that are more relevant to their interests.

However, a large proportion of mobile English readers come from India.  These Indian readers value the pageview-based feed because it frequently surfaces articles that are culturally relevant to them. After the new algorithm is deployed, the feed contains fewer articles these readers find interesting. Over time, they unconsciously begin using the Wikipedia app less frequently in favor of information sources that reflect their interests and meet their needs better.

Process proposals

[edit]

Checklists

[edit]
"“Checklists connect principle to practice. Everyone knows to scrub down before the operation. That's the principle. But if you have to check a box on a form after you've done it, you're not likely to forget. That's the practice. And checklists aren't one-shots. A checklist isn’t something you read once at some initiation ceremony; a checklist is something you work through with every procedure.”[3]
Overview
An ethical AI checklist consists of a list of important steps that must be taken, or questions that must be answered, at each stage of the product development. Checklists work best when the process of working through the checklist is performed consistently, transparently, and collaboratively among team members.
Pros and cons
Pros Cons
Aids in identification of hidden assumptions, potential negative impacts Need to be flexible enough to work across products and team workflows, but standardized enough to ensure a baseline level of due diligence
Can cover both concrete requirements ("do this") and softer requirements ("have a conversation about this before proceeding") Example AI checklists exist, but few have been vetted/tested in actual product development contexts
Facilitates broader participation in decision-making among team members Binary outcome ("we talked about FOO") may encourage rubber-stamping
Makes it easier for any member of the product team to "flag" missed steps or considerations without fear of reprisal
Encourages articulation of audience, purpose, and context; success metrics and thresholds
Increases process consistency between and across teams
Tracks progress towards goals
Further reading
  1. Of oaths and checklists[3]
  2. Care about AI ethics? What you can do, starting today[4]
  3. DEON: An Ethics Checklist for Data Scientists[5]
  4. Ethical OS Toolkit[6]

Impact assessments

[edit]
"Algorithms and the data that drive them are designed and created by people -- There is always a human ultimately responsible for decisions made or informed by an algorithm. "The algorithm did it" is not an acceptable excuse if algorithmic systems make mistakes or have undesired consequences."[7]
Overview
An ethical AI impact statement is a product plan that is published before substantial development begins. Impact statements include a detailed product rationale, supporting research, risk assessment, success criteria, and maintenance and monitoring plans.
Pros and cons
Pros Cons
Aids in identification of hidden assumptions, potential negative impacts Time-consuming to create, and unclear whether the expense is justified
Encourages in-depth justification for design decisions Few real-world examples of "algorithmic impact statements" available to learn from
Encourages articulation of audience, purpose, and context; success metrics and thresholds Substantial overlap with checklists (depending on what's in the checklist)
Documentation increases accountability for outcomes Not always clear who the target audience for the document is
Encourages articulation of audience, purpose, and context; success metrics and thresholds


Further reading
  1. Social Impact Assessment: Guidance for assessing and managing the social impacts of projects[8]
  2. Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability[9]
  3. Principles for Accountable Algorithms and a Social Impact Statement for Algorithms[7]
  4. Ethics & Algorithms Toolkit[10]

Prototyping and user testing

[edit]
"Understanding how people actually interact—and want to interact—with machine learning systems is critical to designing systems that people can use effectively. Exploring interaction techniques through user studies can reveal gaps in a designer’s assumptions about their end-users and may lead to helpful insights about the types of input and output that interfaces for interactive machine learning should support."[11]
Overview
Prototyping is a process of making iterative, incremental refinements to a design based on explicit feedback or observations of use before full deployment. Prototypes are usually lower fidelity than the final product: e.g. sketches, mock-ups, or simplified versions.
Pros and cons
Pros Cons
Encourages definition of APC and success metrics ahead of time Works best when performed by people with some degree of familiarity with UX design or research methods, a resource not available to all teams
Can be performed with low-fidelity interfaces, early stage models, or even before any software or ML engineering has begun. Can even be performed on documentation for datasets and APIs Can slow down development in some cases, can sometimes be challenging to implement in Agile/scrum or other XP paradigms
Allows identification unanticipated issues (such as issues of bias or harm) before committing extensive resources towards a particular design solution Identification of issues (whether bias or use experience) requires access to representative test users and an approximation of a typical context of use
Can help avoid costly failures that require teams to pivot or re-boot late in the design/dev process, or after deployment


Further reading
  1. Power to the People: The Role of Humans in Interactive Machine Learning[11]
  2. User perception of differences in recommender algorithms[12]
  3. The usability and utility of recommender systems can be quantitatively measured through user studies[13]
  4. Making recommendations better: an analytic model for human-recommender interaction[14]

Pilots and trials

[edit]
"Many of our fundamentally held viewpoints continue to be ruled by outdated biases derived from the evaluation of a single user sitting in front of a single desktop computer. It is time for us to look at evaluations in more holistic ways. One way to do this is to engage with real users in 'Living Laboratories', in which researchers either adopt or create real useful systems that are used in real settings that are ecologically valid."[15]
Overview
A pilot is a fixed-term or limited scale deployment of a final product, where the decision to fully deploy is deferred until the outcome of the pilot is assessed. Unlike prototypes, pilots involve putting finished products in front of real users and tracking how the product performs in the wild over an extended period of time.
Audience
The product team, end-users of the product (individuals or communities)
Pros and cons
Pros Cons
Allows team to understand the ecological validity ("does it work as intended?") of the AI product before committing to release it into their product ecosystem and maintain it long term Extends the product development timeline
Allows team to measure the ecological impact ("what are the adjacent and downsteam effects?") of their AI product on the product ecosystem before committing Not a substitute for iterative prototyping and testing
Allows long-term impact measurement of performance (longitudinal analysis) and comparative measurement (A/B testing) against success criteria with real users Unintended negative consequences impact people's lives
Provides baselines for long-term performance monitoring (e.g. detecting model drift)
Increases accountability and supports user trust in the product team and the organization ("if it doesn't work, we will turn it off")


Further reading
  1. A Position Paper on 'Living Laboratories': Rethinking Ecological Designs and Experimentation in Human-Computer Interaction[15]
  2. Behaviorism is Not Enough: Better Recommendations through Listening to Users[16]
  3. Research:Autoconfirmed article creation trial

Interpretable models

[edit]
Overview
Interpretable machine learning models are models that a) expose the logic behind a particular output or decision, and/or b) expose the general features, procedures, or probabilities implicated in their decision-making in a way that the intended audience can understand.
Pros and cons
Pros Cons
Facilitates downstream explanations for individual algorithmic decisions Accuracy and performance may be lower overall compared to more opaque models (e.g. deep learning) for some ML tasks
Facilitates iterative development and comparative evaluation towards fairness and utility benchmarks, not just accuracy and performance Making the model more interpretable may allow people to "game" the system in deceptive or damaging ways
Facilitates external auditing, internal sanity checks, formal user testing, and end-user feedback


Further reading
  1. How the machine ‘thinks’: Understanding opacity in machine learning algorithms[17]
  2. The Promise and Peril of Human Evaluation for Model Interpretability[18]
  3. Toward human-centered algorithm design[19]

End-user documentation

[edit]
"Because the linguistic data we use will always include pre-existing biases and because it is not possible to build an NLP system in such a way that it is immune to emergent bias, we must seek additional strategies for mitigating the scientific and ethical shortcomings that follow from imperfect datasets. We propose here that foregrounding the characteristics of our datasets can help, by allowing reasoning about what the likely effects may be and by making it clearer which populations are and are not represented."[20]
Overview
Detailed descriptions of the intended audience and use cases for an AI product, with a focus on potential issues and limitations, and other special considerations for use.
Pros and cons
Pros Cons
Can ensure that AI product users have the information they need to make informed decisions about how to use the product and/or interpret its functionality Can be costly to create and maintain
Easily transportable with the data/code, wherever it goes Writing good documentation is hard
Easily adaptable to the needs of different users (e.g. third-party tool devs vs. data scientists) and different AI products (e.g. training datasets vs. AI platform APIs) Not always clear how much documentation is necessary and sufficient for a given audience; the documentation itself may require user testing
Many existing frameworks and best practices from software dev are likely applicable to AI product context; some new ones have been proposed specifically for AI bias contexts People don't always read the docs


Further reading
  1. Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science[20]
  2. Increasing Trust in AI Services through Supplier's Declarations of Conformity[21]
  3. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards[22]
  4. Datasheets for Datasets[23]
  5. The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries[24]

UI explanations

[edit]
"How can we provide meaningful control over the recommendation process to users, so that they can understand the decisions they make about their recommendations and customize the system to their particular needs?"[16]
Overview
UI explanations consist of contextual metadata about how a model works that is made available to end users at the point of use. UI explanations can be written in words, or presented as statistical probabilities or graphical visualizations.
Pros and cons
Pros Cons
Can encourage trust among end users Not always clear how much information to include—tension between overwhelming/distracting the user and depriving them of important insights
Empowers end users to make informed decisions about how to use an AI product or how to interpret its decisions Potential tension between choosing the most correct and the most persuasive explanation
Extensive research literature on the effectiveness of different textual, numeric, and visual approaches to explanation, at least in some domains (e.g. recommender systems) Depends on interpretable models (or computational methods for making opaque model output more interpretable)
Testable; it's possible to empirically verify whether an explanation works or not—sometimes even before you build your model or your interface
Encourages feedback, auditing, monitoring against drift, and potentially re-training of the model


Further reading
  1. Evaluating the effectiveness of explanations for recommender systems: Methodological issues and empirical studies on the impact of personalization[25]
  2. Explaining data-driven document classifications[26]
  3. User interface patterns in recommendation-empowered content intensive multimedia applications[27]

Auditing mechanisms

[edit]
"Algorithm transparency is a pressing societal problem. Algorithms provide functions like social sorting, market segmentation, personalization, recommendations, and the management of traffic flows from bits to cars. Making these infrastructures computational has made them much more powerful, but also much more opaque to public scrutiny and understanding. The history of sorting and discrimination across a variety of contexts would lead one to believe that public scrutiny of this transformation is critical. How can such public interest scrutiny of algorithms be achieved?"[28]
Overview
Auditing mechanisms are features that allow individuals or groups outside of the product team also inspect and spot-check individual inputs and outputs of a machine learning model, or critically evaluate the design process behind an AI product.
Pros and cons
Pros Cons
Increases transparency and underscores organizational commitment to ethical and human-centered AI Most effective when paired with interpretable models and UI explanations
Facilitates identification of potentially problematic edge- and corner-cases by external experts and power users Can expose organization to public embarrassment based on individual examples of failure, whether or not those examples are representative of a larger or problematic error patterns (e.g. unfair bias against a group)
Facilitates early detection of model drift may require dedication of substantial platform or personnel resources to support ad hoc use
support requirements vary depending on the capabilities of the auditor and the nature of the audit: do they need a fully-featured web application that supports arbitrary input and provides UI explanations, a well-documented API that exposes model and decision-level metadata, or just a sample dataset and a public GitHub repository?


Further reading
  1. Algorithmic Accountability Reporting: On the Investigation of Black Boxes[29]
  2. Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms[28]

Feedback mechanisms

[edit]
"Behavioral data without proper grounding in theory and in subjective evaluation might just result in local optimization or short term quick wins, rather than long term satisfaction. When can we know from the behavior of a user if the recommendations help to fulfill their needs and goals?"[16]
Overview
Feedback mechanisms are features that allow product users to correct, contest, refine, discuss, or dismiss the output of a machine learning model at the point of use.
Pros and cons
Pros Cons
Can be used to re-train the machine learning model Usefulness of feedback depends heavily on the design of the feedback mechanism
Help the team flag emerging issues of bias, harm, or other unintended consequences Takes resources to monitor, triage, respond to, and make use of feedback (depending on the mechanism for feedback collection and the kind of feedback collected)
Helps the team quickly identify technical and UX issues Privacy considerations around how feedback is captured and stored, and who has access
Increases trust and user acceptance
Can yield insights into user expectations, workflows, and context of use


Further reading
  1. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments[30]
  2. JADE: The Judgement and Dialogue Engine

See also

[edit]

Subpages of this page

[edit]

Pages with the prefix 'Ethical and human-centered AI' in the 'Research' and 'Research talk' namespaces:

Research talk:

References

[edit]
  1. "Ethical OS Toolkit". Ethical OS: A guide to anticipating future impacts of today's technologies (in en-US). Retrieved 2019-01-24. 
  2. "Princeton Dialogues on AI and Ethics". Princeton Dialogues on AI and Ethics (in en-US). 2018-04-19. Retrieved 2019-01-24. 
  3. a b Patil, DJ (2018-07-17). "Of oaths and checklists". O'Reilly Media. Retrieved 2018-12-17. 
  4. Adler, Steven (2018-09-25). "Care about AI ethics? What you can do, starting today". Medium. Retrieved 2019-01-25. 
  5. "Deon: An Ethics Checklist for Data Scientists - DrivenData Labs". drivendata.co. Retrieved 2019-01-25. 
  6. "Ethical OS Toolkit" (in en-US). Retrieved 2019-01-25. 
  7. a b "Principles for Accountable Algorithms and a Social Impact Statement for Algorithms :: FAT ML". www.fatml.org. Retrieved 2019-01-25. 
  8. Franks, Daniel; Aucamp, Ilse; Esteves, Ana Maria; Vanclay, Francis (2015-04-01). "Social Impact Assessment: Guidance for assessing and managing the social impacts of projects". 
  9. Reisman, D., Schultz, J., Crawford, K., & Whittaker, M. (2018). Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability. Retrieved from https://ainowinstitute.org/aiareport2018.pdf
  10. "Ethics & Algorithms Toolkit (beta)". ethicstoolkit.ai. Retrieved 2019-01-25. 
  11. a b Amershi, Saleema; Cakmak, Maya; Knox, William Bradley; Kulesza, Todd (2014-12-22). "Power to the People: The Role of Humans in Interactive Machine Learning". AI Magazine 35 (4): 105–120. ISSN 2371-9621. doi:10.1609/aimag.v35i4.2513. 
  12. Ekstrand, Michael D.; Harper, F. Maxwell; Willemsen, Martijn C.; Konstan, Joseph A. (2014). "User Perception of Differences in Recommender Algorithms". Proceedings of the 8th ACM Conference on Recommender Systems. RecSys '14 (New York, NY, USA: ACM): 161–168. ISBN 9781450326681. doi:10.1145/2645710.2645737. 
  13. Ricci, Francesco; Rokach, Lior; Shapira, Bracha; et al., eds. (2011). "Recommender Systems Handbook". doi:10.1007/978-0-387-85820-3. 
  14. McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660. 
  15. a b Chi, Ed H. (2009). Jacko, Julie A., ed. "A Position Paper on ’Living Laboratories’: Rethinking Ecological Designs and Experimentation in Human-Computer Interaction". Human-Computer Interaction. New Trends. Lecture Notes in Computer Science (Springer Berlin Heidelberg): 597–605. ISBN 9783642025747. doi:10.1007/978-3-642-02574-7_67. 
  16. a b c Ekstrand, Michael D.; Willemsen, Martijn C. (2016). "Behaviorism is Not Enough: Better Recommendations Through Listening to Users". Proceedings of the 10th ACM Conference on Recommender Systems. RecSys '16 (New York, NY, USA: ACM): 221–224. ISBN 9781450340359. doi:10.1145/2959100.2959179. 
  17. Burrell, Jenna (2016-01-05). "How the machine ‘thinks’: Understanding opacity in machine learning algorithms". Big Data & Society 3 (1): 205395171562251. ISSN 2053-9517. doi:10.1177/2053951715622512. 
  18. Herman, Bernease (2017-11-20). "The Promise and Peril of Human Evaluation for Model Interpretability". arXiv:1711.07414 [cs, stat]. 
  19. Baumer, Eric PS (2017-07-25). "Toward human-centered algorithm design". Big Data & Society 4 (2): 205395171771885. ISSN 2053-9517. doi:10.1177/2053951717718854. 
  20. a b Bender, Emily; Friedman, Batya (2018-09-24). "Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science". Transactions of the ACL. 
  21. Hind, Michael; Mehta, Sameep; Mojsilovic, Aleksandra; Nair, Ravi; Ramamurthy, Karthikeyan Natesan; Olteanu, Alexandra; Varshney, Kush R. (2018-08-22). "Increasing Trust in AI Services through Supplier's Declarations of Conformity". arXiv:1808.07261 [cs]. 
  22. Holland, Sarah; Hosny, Ahmed; Newman, Sarah; Joseph, Joshua; Chmielinski, Kasia (2018-05-09). "The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards". arXiv:1805.03677 [cs]. 
  23. Gebru, Timnit; Morgenstern, Jamie; Vecchione, Briana; Vaughan, Jennifer Wortman; Wallach, Hanna; Daumeé III, Hal; Crawford, Kate (2018-03-23). "Datasheets for Datasets". arXiv:1803.09010 [cs]. 
  24. Geiger, R. Stuart; Varoquaux, Nelle; Mazel-Cabasse, Charlotte; Holdgraf, Chris (2018-12-01). "The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries". Computer Supported Cooperative Work (CSCW) 27 (3): 767–802. ISSN 1573-7551. doi:10.1007/s10606-018-9333-1. 
  25. Tintarev, Nava; Masthoff, Judith (2012-10-01). "Evaluating the effectiveness of explanations for recommender systems". User Modeling and User-Adapted Interaction 22 (4): 399–439. ISSN 1573-1391. doi:10.1007/s11257-011-9117-5. 
  26. "MIS Quarterly". misq.org. doi:10.25300/misq/2014/38.1.04. Retrieved 2019-01-25. 
  27. Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2017-02-01). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1573-7721. doi:10.1007/s11042-016-3946-5. 
  28. a b Sandvig, C., Hamilton, K., Karahalios, K., & Langbort, C. (2014). Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms. Data and Discrimination: Converting Critical Concerns into Productive Inquiry, a preconference at the 64th Annual Meeting of the International Communication Association. Seattle, Washington, USA. Retrieved from http://www-personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20--%20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf
  29. Diakopoulos, Nicholas (2014). "Algorithmic Accountability Reporting: On the Investigation of Black Boxes". doi:10.7916/D8ZK5TW2. 
  30. Elsayed, Tamer; Kutlu, Mucahid; Lease, Matthew; McDonnell, Tyler (2016-09-21). "Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments". Fourth AAAI Conference on Human Computation and Crowdsourcing.