Methodology

How Klyptra measures media bias.

This page discloses every methodological decision. Each layer is anchored in research, every finding traceable through verbatim evidence — especially for readers who are initially skeptical of our results.

What does Klyptra measure?

Language and framing in the news — across 6 dimensions with 27 sub-patterns.

↓ Read section

Who decides the values?

Three language models in parallel, with verbatim evidence for every finding.

↓ Read section

How reliable are the values?

Three models judge independently; spread and confidence are openly reported per analysis.

↓ Read section

The scale

0 is not zero. 10 is not perfect.

The six dimensions and the 0–10 scale are Klyptra's operationalization of the Media Bias Taxonomy (Spinde et al. 2023) — not a direct part of BABE, which annotates binarily (biased/neutral). The scale is calibrated against BABE-style expert benchmarks. The five bands are not equally wide: news-agency reporting (dpa, AFP, Reuters) typically hits 8–9 — a value of 10 would be pure fact-listing with no narrative selection at all, practically unreachable. Downward, by contrast, there is more room: the propaganda band (0–2.9), three points wide, is the largest.

9 – 10

1 point

sehr_objektiv

Near-neutral reporting. No evaluative adjectives, balanced sources, speculation clearly marked as such. News-agency level (dpa, AFP, Reuters).

7 – 8.9

2 points

objektiv

Solid journalistic standards. Occasional evaluations detectable, but transparently marked as opinion. Multiple perspectives represented.

5 – 6.9

2 points

moderat_biased

Clearly recognizable editorial line. Word choice with a tendency, one-sided source selection, but no systematic distortion.

3 – 4.9

2 points

stark_biased

Consistently one-sided presentation. Loaded words without labeling, omission of exculpatory facts, emotional charging.

0 – 2.9

3 points

propaganda

Facts are distorted, the other side not quoted or only as a straw man, sensational framing dominates. Widest band — there is more room downward than upward.

The labels in this table are exactly the strings the analyzer outputs in the JSON and the permalink UI (in German) — no UI mapping in between.

The six dimensions — in depth

What each dimension measures and where it comes from.

Framing

Media Bias Taxonomy (Spinde et al. 2023) · Framing bias

Which perspective is declared the narrative norm? Who is subject, who is object?

Operationalization

Active/passive constructions with political asymmetry
Order in which actors are named
Implicit attribution of blame through verb choice

Word choice

Media Bias Taxonomy (Spinde et al. 2023) · Lexical bias

Which words carry judgments without marking them? Loaded language in the narrow sense.

Operationalization

Verbatim identification of evaluative terms
Comparison with neutral synonyms
Density per 1000 words

Source diversity

Media Bias Taxonomy (Spinde et al. 2023) · Selection/Coverage

How many voices are quoted directly, how politically broad is the spectrum?

Operationalization

Number of directly quoted people / institutions
Political positioning of those quoted
Ratio of primary to secondary sources

In the book (Spinde 2025, Ch. 2), source/selection bias is a reporting-level construct that strictly measures across articles. Klyptra approximates it on the single text — the full cross-outlet analysis sits at Tier 2 (not in the score).

Fact / opinion

Media Bias Taxonomy (Spinde et al. 2023) · Epistemological bias

Is evaluation linguistically separated from observation — or sold as fact?

Operationalization

Marking of commentary (“claims”, “according to X”)
Forecasts vs. facts
Subjunctive discipline

Its own dimension, because German news language interweaves evaluation and observation especially tightly at the syntactic level (nominalization, modal verbs, subjunctive I/II) — the English BABE annotation only covers this indirectly.

Completeness

Media Bias Taxonomy (Spinde et al. 2023) · Spin/Omission

What is left out? Which relevant background or counter-positions are missing?

Operationalization

Recognizably missing counter-positions to central claims
One-sided fact selection (cherry-picking)
Context gaps that distort the framing

Omission/spin bias is in part reporting-level in the book (what is missing across several articles). Klyptra assesses the gaps recognizable in the single text; the cross-article level is Tier 2.

Emotional balance

Media Bias Taxonomy (Spinde et al. 2023) · Phrasing/Sentiment

How strongly is it emotionally charged? Sensational or outrage language?

Operationalization

Exclamation-mark density in headline and lead
Escalation vocabulary (“scandal”, “quake”, “madness”)
Adjective intensity

27 sub-categories

Beneath every dimension lies a concrete pattern.

The six dimensions are the top-level axes. Beneath them, Klyptra maintains 27 specific bias patterns following the BiasScanner taxonomy (Menzner & Leidner 2024). Per article, 0–N patterns are identified — each with verbatim evidence, position in the text and a strength assessment.

Sub-categories are qualitative markers, not numeric sub-scores. If a top dimension such as “Completeness” is rated low, the sub-layer shows which concrete pattern carries the finding — e.g. Cherry-Picking or Whataboutism.

Word choice

word_choice

4 patterns

Lexical level — words that evaluate without marking the evaluation as such.

Word Choice Bias
ExampleA “migrant” is consistently called an “intruder”.
Emotional Sensationalism
Example“Nightmare scenario”, “shock diagnosis”, “mood of doom” as routine vocabulary.
Discrimination Bias
ExampleGeneralization about groups (“typical of …-migrants”), mentioning origin without relevance.
Smear / Praise Bias
Example“scandalous attempt” for one party vs. “bold initiative” for the other — for the same kind of action.

Framing

framing

6 patterns

Narrative constructs — how a matter is framed in storytelling, independent of individual words.

Straw Man
Example“The left wants every migrant to get a house immediately.” A caricature of the opposing position is attacked.
False Dichotomy
Example“Either we cut taxes — or the country collapses.” Two options suggested where many exist.
False Analogy
Example“Just like back in 1933 …” for a current political debate with a loose connection.
Insinuative Questioning
Example“Why is the chancellor silent on the accusations?” — without the accusations themselves being substantiated.
Moving Goalposts
Example“5% growth was expected, now it's 8% — so a failure.” The yardstick adjusted after the result.
In-Group / Out-Group Bias
ExampleConsistent “we Germans” vs. “them” — collective assignment of guilt or virtue.

Source diversity

source_diversity

3 patterns

Source quality — who is heard, how they are quoted, whether the voices are classifiable.

Source Selection Bias
ExampleOnly one party's press office is quoted; the other side not at all or only paraphrased.
External Validation Bias
ExampleA lobbyist is introduced as an “independent expert” without naming their interests.
Vague Attribution
Example“Circles report …”, “according to insiders …” carry the central argument of the piece.

Fact–opinion separation

fact_opinion_separation

5 patterns

Linguistic discipline — whether evaluation is marked as evaluation or sold as fact.

Opinionated Bias
Example“The government's catastrophic policy …” — an evaluative adjective in news mode.
Speculation Bias
Example“This will undoubtedly end in disaster.” Forecast without subjunctive, without source.
Unsubstantiated Claims
Example“Millions are affected” — a number without evidence, source or method.
Projection Bias
Example“They only care about power.” Attribution of motive as a statement of fact.
Circular Reasoning
Example“It is illegal because it breaks the law.” The justification repeats the claim.

Completeness

completeness

4 patterns

What is missing — relevant context, counter-arguments, evidence that goes unmentioned.

Cherry-Picking
ExampleOne study is cited; three methodologically comparable studies with the opposite result are not.
Anecdotal Evidence
Example“Ms. M. from Hamburg says …” carries a trend finding; statistical data are missing.
Whataboutism
ExampleConsistent deflection from the main accusation onto the behavior of other actors.
False Balance
ExampleClimate scientists and climate deniers are presented as equivalent voices — although the evidence base is asymmetric.

Emotional balance

emotional_balance

5 patterns

Affective charge — how strongly and in which direction the text colors emotionally.

Ad Hominem
Example“The incompetent minister …” — the person attacked instead of the argument refuted.
Causal Misunderstanding
Example“Since X has governed, Y has fallen — so X is to blame.” Correlation as causation, without a mechanism.
Generalization
Example“All politicians lie”, “the media” as a monolithic actor.
Commercial Bias
ExampleA product report without distance; editorial content not separated from advertising.
Political Bias
ExampleA consistent camp tendency across fact selection, word choice and sources.

Every sub-finding passes the same verbatim gate as the top-level analysis: findings without a quote that can be verified in the original text are discarded. In the JSON export of an analysis, the layer appears as sub_categories[] with parent_dimension, verbatim_quote, char_offset and bias_strength.

Actor analysis — PFA-light

Who is put in which light?

Person-Oriented Framing Analysis (Felix Hamborg 2023) extracts the named actors per article and describes how they are talked about. That is more concrete than any holistic score — and makes systematic asymmetries between actors visible.

Per actor

Each identified person (politician, scientist, citizen, …) gets four fields:

mentions_count
How often the actor appears — across all designations (see Coreference).
sentiment_score
Aggregated tone toward the actor on a scale from −1 (negative) to +1 (positive).
framing_devices
Up to five recurring stylistic devices per actor — e.g. “attribution of blame”, “hero narrative”, “victim staging”.
representative_quotes
Three to five verbatim quotes that carry the framing — the verbatim gate ensures each quote appears 1:1 in the text.

Cross-person analysis

From the individual actors, a distribution observation is computed — whether the article treats the people comparably in language or not.

sentiment_disparity

Difference between the most positive and most negative actor sentiment in the article — computed over all actors with mentions_count ≥ 2.

disparity = max(sentiments) − min(sentiments)

balance_assessment

A qualitative classification of the disparity: balanced / slightly_asymmetric / strongly_asymmetric. With fewer than 2 actors with sufficient mentions, the field returns not_applicable — instead of inventing a value.

Aggregation from models

cross_person_analysis is not taken from the language model but recomputed deterministically from the filtered actor data. This keeps the disparity metric always consistent with the reported sentiment values — even if the model would summarize differently internally.

Coreference documentation

The same person, three names.

Political texts reference the same entity in several ways — by name, by role, by pronoun. Klyptra documents these cross-references explicitly so that mention counts and actor sentiment are not distorted by mere synonymy.

What happens

Per article, a list coreference_documentation.entities[] is reported. Each entity has a canonical_name and a list of all all_mentions[] found in the text.

mention_count is then recomputed deterministically as the sum of non-overlapping substring matches of all mentions in the text — not taken from the language model.

Example

In a report on Ukraine policy:

canonical_name

Volodymyr Zelensky

all_mentions

“Zelensky”
“the Ukrainian president”
“the head of state in Kyiv”

mention_count

Without this resolution, the actor would land in three different buckets — and the sentiment aggregation in the PFA layer would be distorted.

Pipeline

From submitted text to substantiated result.

Every step is independently testable and logged. Anyone who questions a result can trace the chain back to the individual piece of evidence in the text.

Input

Submitted text

The article text to be checked is submitted directly (paste or file) — optionally with a title and source label. Klyptra analyzes exactly this text, not the outlet behind it.

200–50,000 charactersTitle optionalSource optional

Ensemble analysis

3 models in parallel

Three language models independently rate the same text on all six dimensions and extract the detail layers in parallel: sub-category findings, actor mentions with sentiment, and coreference clusters. A chain-of-verification (5 control questions) reduces hallucinations.

gpt-5.4-minimistral-large-2512deepseek-v4-flash

Aggregation

Median + agreement

Numeric scores: median across the three models. Labels: majority vote. The spread of the individual judgments is reported per dimension as model agreement — it stays visible instead of vanishing into the average.

MedianMajority voteAgreement report

Verbatim gate & markup

Evidence checked literally

Every finding must carry a quote that appears exactly in the original text — otherwise the evidence is discarded (the assessment remains marked as model-based). Verified evidence is highlighted in the text with a character offset. The result is a permalink with a 30-day TTL.

exact string matchchar_offset markupPermalink 30 days

One analyzer, two uses

The same analyzer code runs in two contexts:

On-demand (via /analyse): the actual analysis of a submitted text — with a permalink, 30-day TTL.
Reference corpus (internal): a continuously co-analyzed corpus serves exclusively to calibrate the scale. It produces no public outlet profiles and does not feed into individual user analyses.

Ensemble

Three models, because none alone is trustworthy.

Language models have their own, model-specific biases. Klyptra picks three models from three different pre-training pipelines so that the blind spots of a single model become visible through the others. Aggregation is by median (numeric) and majority vote (labels) — all models carry equal weight, none is preferred.

GPT-5.4 mini

OpenAI (USA)

A lean, low-latency variant of the GPT-5.4 line — it carries the on-demand analysis without giving up the score consistency of the larger models.

Mistral Large 2512

Mistral AI (France / EU)

A European pre-training pipeline from an independent provider — it brings a different training basis than the US models.

DeepSeek V4 Flash

DeepSeek (China)

A third, independent pre-training corpus — if this model deviates from the consensus, the spread becomes visible as model agreement.

Aggregation

Numeric scores

Median across all 3 models. Robust against outliers of a single model.

Categorical labels

Majority vote. On a 1:1:1 split, the label falls back to the score nearest the median.

Model agreement

The spread of the three judgments is reported per dimension as high, medium or low agreement — separate from the model's confidence.

Scientific foundations

A peer-reviewed foundation, one methodology.

Klyptra is not a self-construction. Every methodological layer references a peer-reviewed source with a DOI. The differentiation from BiasScanner (our direct predecessor) lies in the three-model ensemble, the German-language specialization, the verbatim gate and the PFA layer.

Concept & 6 dimensions

Spinde et al. (2023) — Media Bias Taxonomy

The six dimensions operationalize the bias types of the Media Bias Taxonomy. BABE (Spinde et al. 2021, EMNLP-Findings) serves as an expert-annotated validation benchmark (binary biased/neutral) and IRR yardstick — the 0–10 scale itself is Klyptra's operationalization, not part of BABE.

Spinde et al. (2023) Media Bias Taxonomy, ACM Comput. Surv., arXiv:2312.16148 · BABE: EMNLP-Findings 2021, DOI: 10.18653/v1/2021.findings-emnlp.101 · consolidated in Spinde (2025), Springer, Open Access, DOI: 10.1007/978-3-658-47798-1

→ 6 dimensions

27 sub-patterns

Menzner & Leidner (2024) — BiasScanner taxonomy

The 27 specific bias categories are placed as a sub-layer beneath the six Klyptra dimensions. BiasScanner is Klyptra's direct scientific predecessor. Not part of Spinde's work — an independent foundation.

Menzner & Leidner (2024) “Improved Models for Media Bias Detection and Subcategorization”, NLDB 2024, pp. 181–196. DOI: 10.1007/978-3-031-70239-6_13

→ 27 sub-patterns

Actor layer

Felix Hamborg (2023) — Person-Oriented Framing Analysis

PFA-light extracts named actors with mention counts, sentiment and framing devices. The cross-person disparity makes systematic asymmetries visible.

Hamborg (2023) “Revealing Media Bias in News Articles”, Springer. DOI: 10.1007/978-3-031-17693-7

→ PFA section

Limits

What Klyptra does not do.

Methodological honesty requires naming the weak points. This list is not exhaustive — contributions are welcome.

LLMs as annotators

Language models have documented biases (Horych et al. 2025). Ensemble + verbatim requirement reduce this but do not eliminate it. Klyptra is not an arbiter — it is a systematic, verifiable indicator.

A snapshot of one text

Klyptra rates exactly the submitted text — not the outlet, newsroom or author behind it. A single result is not a verdict on an outlet; no general “tendency” of a source can be derived from one analysis.

Language level, not factual accuracy

Klyptra measures language and framing — not whether claimed facts are correct. Fact-checking is a separate task (see Correctiv, dpa fact-check).

Evidence without literal coverage is discarded

The verbatim gate keeps only quotes that appear exactly in the text. Paraphrases are not output as evidence — some dimensions therefore deliberately appear without an evidence quote and are marked as a model-based assessment. Honesty over forced evidence.

Detail layers not yet ensemble-aggregated

The six top-level scores are aggregated across all three models (median) and reported with their spread. The detail layers (sub-categories, actor analysis, coreference) currently come from one of the three models — a true union/dedupe aggregation across all models is planned as the next stage. A deliberate MVP decision, not a bug.

German-language specialization

The three models are primarily pre-trained in English. Idiom, subjunctive discipline and irony detection can be thinner in German than in English. Few-shot examples compensate in part; a systematic German ground-truth evaluation is on the research roadmap.

Genre bias: commentary as news

The six dimensions are calibrated for news reporting. If a commentary is submitted, word choice and emotional balance swing strongly, as expected — Klyptra cannot yet reliably tell whether a text should be classified as a report or a commentary. A genre detection as a pre-stage is documented in the methodology roadmap.

Bias annotation is constitutively subjective

Even trained experts reach only an agreement of Krippendorff α ≈ 0.40 on bias labels (Spinde 2025, Ch. 4) — that is the documented maximum of the domain, not a weakness signal. There is no strong “ground truth”; Klyptra's consistency aim targets this expert level, not objective truth.

Technically measurable is not socially relevant

Automated methods can report patterns that are statistically tangible but substantively irrelevant (Spinde 2025, Ch. 8). A score is a systematic indicator, not a final verdict on the significance of a text.

Your own standpoint colors perception

Readers perceive texts that contradict their position as more biased than comparable texts on their own side (hostile-media effect). Studies show that even bias visualizations do not resolve this effect (Spinde 2025, Ch. 7) — a result is read filtered through one's own stance.

Political classification is not a bias statement

Spinde (2025, Ch. 7) shows that a political classification does not increase bias perception — it communicates stance, not distortion. Klyptra's political, economic and social descriptors are a tendency indication and do not feed into the objectivity score.

Version history

When what changed.

Every single analysis carries a methodology_version tag. Existing analyses keep their version — methodology updates create no score drift in historical data.

v1.0

April 2026

6 top-level dimensions + verbatim-quote gate + multi-model ensemble. Scientific basis: Media Bias Taxonomy (Spinde et al. 2023), validated against BABE (2021).

v1.1

May 2026

Added: 27 bias sub-categories (BiasScanner), Person-Oriented Framing Analysis (PFA-light after Felix Hamborg), coreference documentation. Existing v1.0 analyses remain valid — the new layers are additive.

Reproducibility

Methodology, data, prompts.

Klyptra discloses its methodology: model versions, bias dimensions and the underlying research are documented on this page. Every analysis also carries a signature (system-prompt hash) that records its exact methodological state.

Methodology

Bias dimensions, scale and aggregation logic are described in full on this page.

Data

Every analysis is exportable as JSON and PDF for its creator.

Prompts

Versioned prompt templates incl. few-shot examples. Diff log for every change.

How Klyptra measures media bias.

0 is not zero. 10 is not perfect.

What each dimension measures and where it comes from.

Framing

Word choice

Source diversity

Fact / opinion

Completeness

Emotional balance

Beneath every dimension lies a concrete pattern.

Word choice

Word Choice Bias

Emotional Sensationalism

Discrimination Bias

Smear / Praise Bias

Framing

Straw Man

False Dichotomy

False Analogy

Insinuative Questioning

Moving Goalposts

In-Group / Out-Group Bias

Source diversity

Source Selection Bias

External Validation Bias

Vague Attribution

Fact–opinion separation

Opinionated Bias

Speculation Bias

Unsubstantiated Claims

Projection Bias

Circular Reasoning

Completeness

Cherry-Picking

Anecdotal Evidence

Whataboutism

False Balance

Emotional balance

Ad Hominem

Causal Misunderstanding

Generalization

Commercial Bias

Political Bias

Who is put in which light?

Per actor

Cross-person analysis

The same person, three names.

What happens

Example

From submitted text to substantiated result.

Input

Ensemble analysis

Aggregation

Verbatim gate & markup

Three models, because none alone is trustworthy.

GPT-5.4 mini

Mistral Large 2512

DeepSeek V4 Flash

Aggregation

A peer-reviewed foundation, one methodology.

Spinde et al. (2023) — Media Bias Taxonomy

Menzner & Leidner (2024) — BiasScanner taxonomy

Felix Hamborg (2023) — Person-Oriented Framing Analysis

What Klyptra does not do.

LLMs as annotators

A snapshot of one text

Language level, not factual accuracy

Evidence without literal coverage is discarded

Detail layers not yet ensemble-aggregated

German-language specialization

Genre bias: commentary as news

Bias annotation is constitutively subjective

Technically measurable is not socially relevant

Your own standpoint colors perception

Political classification is not a bias statement

When what changed.

Methodology, data, prompts.