Opinion News in Top Stories
Earlier this year, I wrote a post about news stories that are shown in carousels in Google Top Stories Are Chosen By Importance Scores
The patent I wrote about in that post told us that Google may attempt to show opinion pieces related to topics that were being identified as top stories, but it didn’t tell us much about those opinion pieces. A patent that appears to be related to it is on identifying opinions in documents that might be news articles that could be ranked based on some importance measure.
A patent granted earlier this month tells us about how Google may use machine learning to identify opinions in documents on the Web.
In more detail, this patent is about systems and methods that employ one or more machine-learned models to classify portions of documents as opinion or not opinion. So that portions that are classified as opinion can be considered to be included in an informational display.
The description for the patent starts by telling us that “understanding of content (e.g., textual content) contained in a document by a computing system is a challenging problem.”
It points out that is especially true in the professional news journalism space, which has articles that are typically written in high-quality language and syntax. And that computing systems have difficulty understanding only a very little about the actual content of those news articles. So this new patent is telling us that it is focusing on news information.
It also tells us about difficulties in determining how one article compares to others, and that related news articles written by other journalists is an even more challenging task. The top stories patent didn’t really tell us how it might choose one article over another one to display in carousels, so this is good to see more about.
We are told that production systems which can select and provide documents (e.g., news articles) to searchers are selected based almost exclusively on shallow content signals (such as salient terms and entities, etc.) and/or metadata (e.g., how important a publisher is, when the content was published (e.g., relative to other articles), references (e.g., links) between articles, etc.).
The patent identifies several problems, telling us that such production systems typically do not rely on a nuanced understanding of the actual content of the articles themselves.
The solution we are told involves many research areas that are related to the computerized understanding of document content, and that work in the area of subjectivity detection attempts to identify subjective text.
These kinds of subjectivity detection techniques will often use either a lexicon or a model trained using a lexicon, and unfortunately, the use of such a lexicon can be inherently limiting.
We are also told that subjectivity in itself is not particularly informative. As an example, “This is great!” is a subjective sentence, but by itself, it is not very informative.
Sentiment analysis will try to capture sentiment (i.e., positive, negative, or neutral) of text generally, or the sentiment about some particular aspect/topic/entity (e.g., positive or negative view on an international treaty) that content may be about.
But, we are told that sentiment analysis at the sentence level does not provide any understanding of what the text actually says.
And that sentiment analysis at the aspect/topic/entity level can be more insightful, but it has restrictions:
That aspect/topic/entity must exist in some knowledge base and it can be difficult to determine how two aspects/topics/entities relate to each other.
Also, work in the related area of stance detection is usually about finding for or against a specific topic (such as in a proposed legislative action).
However, the resulting systems only work for the topics they are trained on and can have limited applicability to new or developing topics.
This patent attempts to provide a solution in the face of all of these problems.
It starts with a machine-learned opinion classification model that is configured to classify portions of documents as either opinion or not opinion.
Once that classification is done, several operations are performed.
A first step may involve obtaining data descriptive of a document that comprises one or more portions.
Then inputting at least part of the document into a machine-learned opinion classification model.
Afterward receiving, as an output of the machine-learned opinion classification model, a classification of that portion of the document as being opinion or not opinion.
This patent can be found at:
Machine learning to identify opinions in documents
Inventors: Boris Dadachev and Kishore Papineni
Assignee: Google LLC
US Patent: 10,832,001
Granted: November 10, 2020
Filed: April 26, 2018
Example aspects of the present disclosure are directed to systems and methods that employ a machine-learned opinion classification model to classify portions (e.g., sentences, phrases, paragraphs, etc.) of documents (e.g., news articles, web pages, etc.) as being opinions or not opinions. Further, in some implementations, portions classified as opinions can be considered for inclusion in an informational display. For example, document portions can be ranked according to the importance and selected for inclusion in an informational display based on their ranking. Additionally or for systems which access and consider multiple documents, the portions of a document that are classified as opinion can be compared to similarly-classified portions of other documents to perform document clustering, to ensure diversity within a presentation, and/or other tasks.
How to Identify Opinions in Documents
This patent tells us about systems and methods employing a machine-learned opinion classification model used to classify portions such as sentences, phrases, paragraphs, etc. of news articles, web pages, and other documents, as being opinions or not opinions.
Those portions that have been classified as opinions may be included in an informational display.
Document portions may be ranked according to the importance and then be selected for inclusion based on their ranking.
The patent tells us that when considering multiple documents, the portions of documents classified as opinions may be compared to similarly-classified portions of other documents to perform document clustering and that this would help to ensure diversity within a presentation, and in other tasks.
Classifying Opinion and Importance
So this computing system would have two main components:
- A machine-learned opinion classification model, which obtains portions from a document and classifies them as opinionated or not opinionated
- A summarization algorithm, which would rank portions from a document by a portion importance approach (as well as possibly other example criteria such as the ability to stand alone without further context)
How Opinions would be Displayed in Search Results
These two components may be used to show a searcher a document portion that is both important and opinionated.
An example would be showing a display identifying some documents, with a summary or “snippet” for each, where each snippet is taken from a portion of the document classified as an opinion and/or ranked as having high importance.
That snippet could be used when providing search results in response to a query, as part of a “top stories” or “what to read next” feature for a news aggregation/presentation application, or in other scenarios, that could include a presentation of several different news articles that relate to the same overarching news “story.”
This patent would leverage machine-learning to generate improved summaries or “snippets” of documents such as news articles to a user.
By providing snippets that better reflect actual opinionated content, instead of generic facts or quotes, the searcher can more quickly comprehend the true nature of the document and decide whether she is interested in reading the document in full.
A searcher can load and read documents in which she may not be interested in reading.
By identifying and comparing portions of documents classified as actual opinionated content, informational displays can be offered with improved diversity, structure, and other features taking into account the actual content of the documents.
The searcher can avoid reading articles featuring redundant opinions.
And opinions, as seen in editorials, “op-eds,” commentaries, and the like, fulfill an essential role in the news journalism ecosystem.
They give editorial teams, outside experts, and ordinary citizens voices, a chance to participate in the public debate on a given issue or event.
This can help the public see different sides of a story and break filter bubbles.
An opinion can include a viewpoint or conclusion that an author of a document explicitly writes into that document.
Sometimes, opinions or opinionated portions of a document can be less explicitly recognized as such.
For example, a rhetorical question can be a form of opinion depending on how it is phrased, such as sarcasm.
As another example, a summary of facts can be an opinion or indicate an opinion depending on which:
- Sections of the overall facts are selected
- The order those facts are presented
- Interstitial wording
- Other factors
What will This Information Display Look Like, and How will We Get there?
The patent tells us that determining whether a portion of a document is opinion is a challenging task and requires a nuanced understanding of human communication.
The computer system aggregating and presenting news articles to a searcher may include or show a snippet for a particular article.
The snippet may mimic or mirrors the headline of articles.
In other instances, the snippet may be from the output of a generic multi-document extractive summarization algorithm.
We are told that a generic summarization algorithm typically does not consider the subjectivity of a piece of text.
So, in trying to highlight and summarize the subjective portions of an opinion piece, a generic summarization algorithm would typically not be able to identify a summary that effectively conveys the actual opinion put forth by the article.
The patent tells us that stance detection would be extremely useful to enable a better understanding of stories (using clusters of articles from different publishers on the same news event).
But stance is hard to define and as such to quantify.
Because of these challenges, the present patent recognizes that news documents typically come in two main flavors, neutral reporting of news events and opinions cast on these events.
Being able to separate opinionated from the neutral text in news articles can be useful to filter out non-stance carrying text, which can assist in performing stance detection.
The present disclosure can be useful in performing stance detection or other related tasks by:
- Identifying opinionated portions in documents
- Relating opinionated portions inside the document and/or across other documents (e.g., that relate to the same story)
- To surface opinionated snippets or quotes to users of a news aggregation/presentation application and/or in the form of search results
- To identify portions of a document that convey opinion (e.g., as contrasted with quotes and facts)
The patent tells us that this classification model will be used to filter “un-interesting” portions for stance detection purposes, such as quotes and facts.
Classifying Portions of Documents for Opinions
The computing system described in this patent can input each portion into the opinion classification model and the model can produce a classification for the input portion.
The types of documents it will classify can include:
- News articles
- Web pages
- Transcripts of a conversation (e.g., interview)
- Transcripts of a speech
- Other documents
Portions of documents that might be classified would include:
- Pairs of consecutive sentences
These portions can be overlapping or non-overlapping.
Determining Opinion in Documents
The patent tells us that “opinionatedness” (i.e., the degree to which something is or conveys opinion) is somewhat subjective and extremely topic- and context-dependent.
And it tells us that for this reason, and because simpler methods have clear limitations, the systems and methods from the patent use a machine learning approach.
A drawback of existing approaches is that the use of a pre-defined lexicon does not sufficiently account for the context in which a portion is presented.
As an example, the term “short-sighted” is clearly an opinionated word in a political article but is probably not in a medical article.
And as another example, “dedicated” is opinionated when qualifying a person but not when qualifying an object.
So the basic use of a lexicon to identify opinionated portions does not appropriately capture or account for context.
The use of a machine-learned model as described herein provides superior results that evince contextual and/or topic-dependent understanding and classification.
As one example, the machine-learned opinion classification model can include one or more artificial neural networks (“neural networks”).
Some example neural networks can include:
- Feed-forward neural networks
- Recurrent neural networks
- Convolutional neural networks
- Other forms of neural networks
We are also told that neural networks can include layers of hidden neurons and can, in such instances, be referred to as deep neural networks.
And the machine-learned opinion classification model can include an embedding model that encodes a received portion of the document.
The embedding model can produce an embedding at a final or a close to final, but not the final layer of the model.
It can encode information about the portion of the document in an embedding dimensional space.
The machine-learned opinion classification model may also include a label prediction model generating a classification label based on the encoding or embedding.
The embedding model can be or include a recurrent neural network (e.g., a unidirectional or bidirectional long short-term memory network) while the label prediction model can be or include a feed-forward neural network (e.g., a shallow network with only a few layers).
The embedding model can be or include a convolutional neural network that has one-dimensional kernels designed over words.
The machine-learned opinion classification model can include or leverage:
- Sentence embeddings
- Bag-of-word models (e.g., unigram, bigram and/or trigram levels)
- Other forms of models
The opinion classification model could be a binary classifier, which means that it could produce a label of “Opinion” or “Not Opinion” for each portion of the document input into the model.
Or the opinion classification model can be a multi-class classification model.
For example the classification model can output one of following three classes:
- Not Opinion
- Reported Opinion (e.g., representing the opinion of a third party such as, for example, quoting someone’s opinion)
- Author Opinion (e.g., representing the opinion of the author of the document)
The portions that are classified as “Opinion” or “Author Opinion” may be considered for inclusion in an informational display (e.g., in the form of opinionated snippets).
It is possible that additional and/or different labels are used rather than those.
For example, additional labels may be used (e.g., in addition to “Opinion” or “Not Opinion”) that account for cases in which a portion of the document is challenging to classify (e.g., exists on the border between opinion and not-opinion) or contains a mix of fact and opinion.
These lables could be such things as a:
- “May Be Opinion” label
- A “Possible Author’s Perspective” label
- A “Mixed Fact and Opinion” label
The classification model might output a classification score and a label might then be generated based on the classification score (e.g., by comparing the score to a threshold).
Or the classification score might be referred to as a confidence score.
Usually, a larger classification score indicates that the corresponding portion is more opinionated or more probably opinionated.
For example, a classification model may output a classification score ranging from 0 to 1, with 0 corresponding to wholly not opinion and 1 corresponding to wholly opinion.
Following that, a classification score of 0.5 may indicate a mix of opinion or not opinion.
Or, the classification model may output a single score and a label may be generated based on a single classification score (e.g., by comparing the score to one or more thresholds).
Or, the classification model may output a respective score for each available label and one or more labels may be applied to the portion based on the multiple scores (e.g., by comparing each respective score to a respective threshold and/or by selecting the label that received the highest score).
Or, additional features may be used (e.g., provided as input alongside the document itself to the model or used separately as additional selection logic).
Examples of additional features could include:
- A lexicon
- Main article topic(s)
- Surrounding context
- Story context
- Document type (e.g., news article vs. medical academic paper)
- Context on publisher and/or journalist
- Other features
As another example, only portions of a document classified as an opinion and that also have at least two strongly subjective words according to a subjectivity lexicon are selected.
Opinion Training Datasets
The machine-learned opinion classification model may be trained based on many different training schemes or training datasets.
The patent tells us of two training datasets.
- A first training dataset can include opinion pieces from a news corpus, where opinion labels are applied at the document level
- And a second, better quality training dataset that includes documents with portions that have been individually and manually labeled using crowdsourcing
For example, the labels can be applied according to two classes:
- Sentence reflects the opinion of the author
- Everything else; or according to the three classes described above which include a distinction between author opinion and reported opinion
The first training dataset could be used to improve or seed the classification model (e.g., to learn an embedding, to leverage labeled but noisy data).
The second training dataset would enable the training of a higher precision classifier.
The machine-learned opinion classification model could be trained on only the second training dataset.
A pre-trained language processing model, that has been trained on other tasks, maybe re-trained on the first and/or second training datasets to generate the opinion classification model.
This pre-trained language processing model could include the Word2vec ground of models.
The first training dataset may be generated by identifying opinion articles by the application of various search rules.
This process could extract opinion and non-opinion articles from a news corpus, by looking at keywords such as “opinion” or “oped” in the URL or the body of the article.
From the first training dataset, all sentences from the identified articles may be extracted.
The labeling of documents may then be assigned to each portion (e.g., sentence) from such a document. This would provide a relatively simple and fast technique to generate a large first training dataset.
This first training dataset would have a drawback due to the way it is constructed: the resulting classification model learns to predict whether the sentence is likely to be part of an opinion piece, rather than whether it expresses an opinion.
This is why training on the more fine-grained second training dataset would result in a significant improvement.
The patent tells us that sometimes training of the model may be performed only using the second training dataset and not the first.
The training for the second dataset may involve that additional data be collected on a number of related aspects:
- Whether the sentence is boilerplate (“Sign up to receive our awesome newsletter!”
- Whether the opinion expressed is the author’s own (as opposed to reported opinions in, for example, quotes
- Whether the sentence can stand on its own
- Whether the sentence can be understood only knowing the article title
The example training schemes described would enable the machine-learned opinion classification model to learn how opinions are expressed from a large annotated corpus.
This model would take the entire portion as input (e.g., models that include a recurrent neural network that takes each word as input sequentially), and the model would be trained to understand and leverage structural information included in the portions, including sentence structure.
The training system would obtain document data that includes opinion labels.
The training computing system may determine correlations between aspects of the document data such as:
- Sentence structure
- Word choice
- Document features
- Opinion classifications
This training system could iteratively update correlations on receiving new document data to form the opinion classification model.
Once the model is trained, the opinion classification model can be used to identify opinions within documents.
The machine-learned classification model would not be limited to a narrow domain (such as a specific dictionary or lexicon) but instead can process any language inputs.
The opinion classification model would be easily extensible to other languages (by generating training datasets for such other languages).
Displaying Opinion Information from Documents
Following another aspect of the patent, this system could generate a snippet or summary for the document based on classifications that have been generated by the opinion classification model.
The system may perform a ranking process to determines which sentence would serve best as a standalone snippet.
That selected sentence should be relevant to the story and read well without context.
The ranking can be generated based on:
- A summarization of each respective portion
- Identification of respective entities mentioned by each respective portion
- A respective classification score assigned to each respective portion by the machine-learned opinion classification model
Determining the importance can be done by either looking at the document only or by looking at clusters of documents.
Looking at the clusters allows the system to diversify the point-of-views that are highlighted in the snippets.
The system can perform one or more document summarization algorithms to rank portions of the document in selecting a snippet.
The document summarization algorithm may:
- Choose a set of candidate portions for each document
- Taking into account portion importance, portion opinionatedness (e.g., as reflected by the portion’s corresponding classification score)
- Snippet diversity across articles in the cluster (e.g., story)
The system can perform a standard document summarization algorithm, but it may restrict the input to the algorithm to only portions labelled as opinions by the classification model.
The summarization algorithm can discard all sentences unsuitable as snippets, and then select the most opinionated sentence, as could be indicated by the scores produced by the classification model.
A document summarization algorithm can combine sentence importance and opinionatedness to re-rank sentences.
Sort sentences in decreasing order of importance_score*opinion_confidence.
If no sentence is deemed opinionated, a non-opinionated sentence may be returned.
This approach is the most flexible and allows additional heuristics (e.g., the snippet could be restricted to the top three portions according to the summarization algorithm).
So, the patented process can be used for selecting snippets that reflect the author’s opinion for opinion pieces in a news aggregation/presentation application.
Opinion pieces can be displayed in:
- Opinion blocks (e.g., alongside additional opinion pieces for a given news story)
- Alongside non-opinion pieces
- Standing alone
One goal would be to provide snippets enticing the users to read an opinion piece.
Other goals would include:
- A way to filter sentences that cast a light on authors’ opinions
- Discarding factual sentences and other un-interesting sentences (e.g., quotes)
This opinion classification model also provides a way to select only neutral or factual sentences for non-opinion articles (e.g., by removing sentences labeled as opinion).
The respective opinion sentences in articles across an entire news story can be clustered to understand which articles share the same point of view.
This can provide a better understanding of individual articles and a comparison between articles on the same news event or overarching story.
By isolating the point-of-view/opinions of an author, the system can determine how perspectives are shared or differ between several authors and newspapers, to allow for clustering, diversification, and/or other tasks.
By clustering based on opinionated portions, a more nuanced understanding of different positions concerning a subject can be ascertained.
A typical sentiment analysis or stance detection may uncover a fixed and finite set of sentiments (e.g., generally “for”, “against”, or “neutral”).
Clustering based on opinion might reveal six or seven overlapping but distinct positions about a certain subject.
The opinion identification and snippet selection techniques described here will not provide a fixed, finite set of positions to train on but instead, a more natural and nuanced clustering of opinions can be obtained.
Thus, the patented process can leverage machine-learning for improved summaries or “snippets” of documents such as news articles to a searcher.
With snippets that better reflect actual opinionated content (e.g., rather than generic facts or quotes), the user can more quickly comprehend the true nature of the document and ascertain whether she is interested in reading the document in full.
A searcher can avoid loading and reading documents that she has no interest in reading.
By identifying and comparing portions of documents that have been classified as actual opinionated content (e.g., rather than generic facts or quotes), the patented process can provide informational displays with improved diversity, structure, and other features taking into account the actual content of the documents.
Searchers can more easily ascertain a more diverse representation of the different stances included in the documents and can avoid reading articles featuring redundant opinions.
Opinion News Take Aways
This patent provides more details about how a machine learning approach might be used to identify opinion news to potentially show with top stories or in Google News results.
I like the idea that such opinions are likely to be written by people who aren’t necessarily journalists but could be consumers or people in industries that are involved in the subject matter of a story (such as an employee of a company, a professional athlete, or a scientist).
If you are involved in writing about news on different topics, writing opinion pieces may be a way to have your content shared with a large group of people. People are interested in such opinions, and they are worth sharing.
It’s good seeing Google find a way to identify opinion news and incorporate those opinions into content such as top stories in the news.