User:MPopov (WMF)/Notes/Text categorization
Consider these two scenarios:
- You want to group articles together but you don't know what the groups are ahead of time
- You want to categorize some piece of text – for example:
- which job category a job posting is for
- identify the author when the author is unknown
- when a user comment is harassment
- when an email is spam
- whether a review is positive or negative
These are examples of problems in natural language processing and a combination of statistical analysis, machine learning, and information retrieval called text mining. In all of these cases what you're interested in is a predictive model which, when given an input (data), predicts some output (category).
Terminology
[edit]A document is a single unit of analysis and is usually made up of tokens (usually words, but can also be combinations of words called n-grams). A document can be any size:
- each comment on a Talk page
- a whole page (either an article or an article's Talk page)
- each chapter in a book
- each book in a library
A document may also be made up of smaller documents so that analysis can be performed hierarchically. For example, instead of analyzing toxicity of a Talk page as a single blob of text, you might analyze toxicity of individual comments, aggregate those to form your analysis of toxicity of individual topics/conversations, and aggregate those to form your analysis of toxicity of the whole page.
The process of breaking down a document into smaller units of analysis (usually tokens) is called tokenization and it varies from language to language. This is useful for calculating term frequencies (counts of how many times a term appeared) and obtaining embeddings (numerical representation of a term, useful for calculating similarities).
Scenarios
[edit]Scenario 1: Topic Modeling
[edit]If you don't know what groups your documents should be grouped into, that's a job for topic modeling. The topics (groups) are a latent (hidden) variable and you use statistical models to infer membership of documents to unknown groups.
I recommend the following resources for learning more about and actually doing this:
- Text Mining with R (free book) chapter on Topic Modeling
- Also this blog post from one of the book's author's: The game is afoot! Topic modeling of Sherlock Holmes stories
- Topic modeling in Python with gensim
- Also Natural Language Toolkit for Python
Scenario 2: Classification
[edit]If you already know which groups your documents should be categorized as and you have example documents for each of those groups, this becomes a supervised (classification) learning problem. The idea is to train a model to predict classes (groups). Binary classification such as spam email detection (spam vs ham), toxic comment detection (toxic vs not), and sentiment analysis (positive vs negative) can be extended to 2+ classes (e.g. which genre of music a song belongs to, which job category a job posting is for).
I recommend the following resources for learning more about and actually doing this:
- Text Mining with R (free book) chapter on Sentiment analysis with tidy data
- Supervised Machine Learning for Text Analysis in R (free book)
- Working with text data in scikit-learn
- Deep neural network-based solutions
ML as a service
[edit]You could stop there if your project was just a one-off categorization exercise, but you may also be interested in categorizing documents on a regular basis. So, once you have a model you're satisfied with (acceptable accuracy, reasonable runtime for performing predictions, doesn't require too many resources), you can make it available as an API. You pass the data (e.g. documents) to an endpoint (local or hosted remotely), the application processes the received data, passes the processed data to the model, the model outputs predictions, and the application responds to the web request with those predictions.
I recommend the following resources:
- Quickstart guide to plumber (R package)
- Quickstart guide to Flask (Python package)
- Introduction to FastAPI (Python package)
Making a predictive model available as an API is part of productionizing the model, but there many other parts (scalability, latency, dealing with concept drift, dealing with bias, ease of redeployment) involved in ML in production. I recommend the following resources for learning about it:
Working with data
[edit]I recommend the following resources for learning how to work with data:
- R
- Hands-On Programming with R (free book)
- R for Data Science (free book)
- Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (free book)
- Python
- Think Python 2nd Ed. (free book)
- Think Stats 2nd Ed. (free book, an introduction to Probability and Statistics for Python programmers)
- Machine Learning with Python Cookbook
- Pandas for Everyone: Python Data Analysis