Methodology

How Klyptra measures media bias.

This page discloses every methodological decision. Each layer is anchored in research, every finding traceable through verbatim evidence — especially for readers who are initially skeptical of our results.

The scale

0 is not zero. 10 is not perfect.

The six dimensions and the 0–10 scale are Klyptra's operationalization of the Media Bias Taxonomy (Spinde et al. 2023) — not a direct part of BABE, which annotates binarily (biased/neutral). The scale is calibrated against BABE-style expert benchmarks. The five bands are not equally wide: news-agency reporting (dpa, AFP, Reuters) typically hits 8–9 — a value of 10 would be pure fact-listing with no narrative selection at all, practically unreachable. Downward, by contrast, there is more room: the propaganda band (0–2.9), three points wide, is the largest.

9 – 10
1 point
sehr_objektiv
Near-neutral reporting. No evaluative adjectives, balanced sources, speculation clearly marked as such. News-agency level (dpa, AFP, Reuters).
7 – 8.9
2 points
objektiv
Solid journalistic standards. Occasional evaluations detectable, but transparently marked as opinion. Multiple perspectives represented.
5 – 6.9
2 points
moderat_biased
Clearly recognizable editorial line. Word choice with a tendency, one-sided source selection, but no systematic distortion.
3 – 4.9
2 points
stark_biased
Consistently one-sided presentation. Loaded words without labeling, omission of exculpatory facts, emotional charging.
0 – 2.9
3 points
propaganda
Facts are distorted, the other side not quoted or only as a straw man, sensational framing dominates. Widest band — there is more room downward than upward.

The labels in this table are exactly the strings the analyzer outputs in the JSON and the permalink UI (in German) — no UI mapping in between.

The six dimensions — in depth

What each dimension measures and where it comes from.

Framing

Media Bias Taxonomy (Spinde et al. 2023) · Framing bias

Which perspective is declared the narrative norm? Who is subject, who is object?

Operationalization

  • Active/passive constructions with political asymmetry
  • Order in which actors are named
  • Implicit attribution of blame through verb choice

Word choice

Media Bias Taxonomy (Spinde et al. 2023) · Lexical bias

Which words carry judgments without marking them? Loaded language in the narrow sense.

Operationalization

  • Verbatim identification of evaluative terms
  • Comparison with neutral synonyms
  • Density per 1000 words

Source diversity

Media Bias Taxonomy (Spinde et al. 2023) · Selection/Coverage

How many voices are quoted directly, how politically broad is the spectrum?

Operationalization

  • Number of directly quoted people / institutions
  • Political positioning of those quoted
  • Ratio of primary to secondary sources

In the book (Spinde 2025, Ch. 2), source/selection bias is a reporting-level construct that strictly measures across articles. Klyptra approximates it on the single text — the full cross-outlet analysis sits at Tier 2 (not in the score).

Fact / opinion

Media Bias Taxonomy (Spinde et al. 2023) · Epistemological bias

Is evaluation linguistically separated from observation — or sold as fact?

Operationalization

  • Marking of commentary (“claims”, “according to X”)
  • Forecasts vs. facts
  • Subjunctive discipline

Its own dimension, because German news language interweaves evaluation and observation especially tightly at the syntactic level (nominalization, modal verbs, subjunctive I/II) — the English BABE annotation only covers this indirectly.

Completeness

Media Bias Taxonomy (Spinde et al. 2023) · Spin/Omission

What is left out? Which relevant background or counter-positions are missing?

Operationalization

  • Recognizably missing counter-positions to central claims
  • One-sided fact selection (cherry-picking)
  • Context gaps that distort the framing

Omission/spin bias is in part reporting-level in the book (what is missing across several articles). Klyptra assesses the gaps recognizable in the single text; the cross-article level is Tier 2.

Emotional balance

Media Bias Taxonomy (Spinde et al. 2023) · Phrasing/Sentiment

How strongly is it emotionally charged? Sensational or outrage language?

Operationalization

  • Exclamation-mark density in headline and lead
  • Escalation vocabulary (“scandal”, “quake”, “madness”)
  • Adjective intensity

27 sub-categories

Beneath every dimension lies a concrete pattern.

The six dimensions are the top-level axes. Beneath them, Klyptra maintains 27 specific bias patterns following the BiasScanner taxonomy (Menzner & Leidner 2024). Per article, 0–N patterns are identified — each with verbatim evidence, position in the text and a strength assessment.

Sub-categories are qualitative markers, not numeric sub-scores. If a top dimension such as “Completeness” is rated low, the sub-layer shows which concrete pattern carries the finding — e.g. Cherry-Picking or Whataboutism.

Word choice

word_choice
4 patterns

Lexical level — words that evaluate without marking the evaluation as such.

  • Word Choice Bias

    ExampleA “migrant” is consistently called an “intruder”.

  • Emotional Sensationalism

    Example“Nightmare scenario”, “shock diagnosis”, “mood of doom” as routine vocabulary.

  • Discrimination Bias

    ExampleGeneralization about groups (“typical of …-migrants”), mentioning origin without relevance.

  • Smear / Praise Bias

    Example“scandalous attempt” for one party vs. “bold initiative” for the other — for the same kind of action.

Framing

framing
6 patterns

Narrative constructs — how a matter is framed in storytelling, independent of individual words.

  • Straw Man

    Example“The left wants every migrant to get a house immediately.” A caricature of the opposing position is attacked.

  • False Dichotomy

    Example“Either we cut taxes — or the country collapses.” Two options suggested where many exist.

  • False Analogy

    Example“Just like back in 1933 …” for a current political debate with a loose connection.

  • Insinuative Questioning

    Example“Why is the chancellor silent on the accusations?” — without the accusations themselves being substantiated.

  • Moving Goalposts

    Example“5% growth was expected, now it's 8% — so a failure.” The yardstick adjusted after the result.

  • In-Group / Out-Group Bias

    ExampleConsistent “we Germans” vs. “them” — collective assignment of guilt or virtue.

Source diversity

source_diversity
3 patterns

Source quality — who is heard, how they are quoted, whether the voices are classifiable.

  • Source Selection Bias

    ExampleOnly one party's press office is quoted; the other side not at all or only paraphrased.

  • External Validation Bias

    ExampleA lobbyist is introduced as an “independent expert” without naming their interests.

  • Vague Attribution

    Example“Circles report …”, “according to insiders …” carry the central argument of the piece.

Fact–opinion separation

fact_opinion_separation
5 patterns

Linguistic discipline — whether evaluation is marked as evaluation or sold as fact.

  • Opinionated Bias

    Example“The government's catastrophic policy …” — an evaluative adjective in news mode.

  • Speculation Bias

    Example“This will undoubtedly end in disaster.” Forecast without subjunctive, without source.

  • Unsubstantiated Claims

    Example“Millions are affected” — a number without evidence, source or method.

  • Projection Bias

    Example“They only care about power.” Attribution of motive as a statement of fact.

  • Circular Reasoning

    Example“It is illegal because it breaks the law.” The justification repeats the claim.

Completeness

completeness
4 patterns

What is missing — relevant context, counter-arguments, evidence that goes unmentioned.

  • Cherry-Picking

    ExampleOne study is cited; three methodologically comparable studies with the opposite result are not.

  • Anecdotal Evidence

    Example“Ms. M. from Hamburg says …” carries a trend finding; statistical data are missing.

  • Whataboutism

    ExampleConsistent deflection from the main accusation onto the behavior of other actors.

  • False Balance

    ExampleClimate scientists and climate deniers are presented as equivalent voices — although the evidence base is asymmetric.

Emotional balance

emotional_balance
5 patterns

Affective charge — how strongly and in which direction the text colors emotionally.

  • Ad Hominem

    Example“The incompetent minister …” — the person attacked instead of the argument refuted.

  • Causal Misunderstanding

    Example“Since X has governed, Y has fallen — so X is to blame.” Correlation as causation, without a mechanism.

  • Generalization

    Example“All politicians lie”, “the media” as a monolithic actor.

  • Commercial Bias

    ExampleA product report without distance; editorial content not separated from advertising.

  • Political Bias

    ExampleA consistent camp tendency across fact selection, word choice and sources.

Every sub-finding passes the same verbatim gate as the top-level analysis: findings without a quote that can be verified in the original text are discarded. In the JSON export of an analysis, the layer appears as sub_categories[] with parent_dimension, verbatim_quote, char_offset and bias_strength.

Actor analysis — PFA-light

Who is put in which light?

Person-Oriented Framing Analysis (Felix Hamborg 2023) extracts the named actors per article and describes how they are talked about. That is more concrete than any holistic score — and makes systematic asymmetries between actors visible.

Per actor

Each identified person (politician, scientist, citizen, …) gets four fields:

  • mentions_count

    How often the actor appears — across all designations (see Coreference).

  • sentiment_score

    Aggregated tone toward the actor on a scale from −1 (negative) to +1 (positive).

  • framing_devices

    Up to five recurring stylistic devices per actor — e.g. “attribution of blame”, “hero narrative”, “victim staging”.

  • representative_quotes

    Three to five verbatim quotes that carry the framing — the verbatim gate ensures each quote appears 1:1 in the text.

Cross-person analysis

From the individual actors, a distribution observation is computed — whether the article treats the people comparably in language or not.

sentiment_disparity

Difference between the most positive and most negative actor sentiment in the article — computed over all actors with mentions_count ≥ 2.

disparity = max(sentiments) − min(sentiments)

balance_assessment

A qualitative classification of the disparity: balanced / slightly_asymmetric / strongly_asymmetric. With fewer than 2 actors with sufficient mentions, the field returns not_applicable — instead of inventing a value.

Aggregation from models

cross_person_analysis is not taken from the language model but recomputed deterministically from the filtered actor data. This keeps the disparity metric always consistent with the reported sentiment values — even if the model would summarize differently internally.

Coreference documentation

The same person, three names.

Political texts reference the same entity in several ways — by name, by role, by pronoun. Klyptra documents these cross-references explicitly so that mention counts and actor sentiment are not distorted by mere synonymy.

What happens

Per article, a list coreference_documentation.entities[] is reported. Each entity has a canonical_name and a list of all all_mentions[] found in the text.

mention_count is then recomputed deterministically as the sum of non-overlapping substring matches of all mentions in the text — not taken from the language model.

Example

In a report on Ukraine policy:

canonical_name

Volodymyr Zelensky

all_mentions

  • “Zelensky”
  • “the Ukrainian president”
  • “the head of state in Kyiv”

mention_count

7

Without this resolution, the actor would land in three different buckets — and the sentiment aggregation in the PFA layer would be distorted.

Pipeline

From submitted text to substantiated result.

Every step is independently testable and logged. Anyone who questions a result can trace the chain back to the individual piece of evidence in the text.

01

Input

Submitted text

The article text to be checked is submitted directly (paste or file) — optionally with a title and source label. Klyptra analyzes exactly this text, not the outlet behind it.

200–50,000 charactersTitle optionalSource optional
02

Ensemble analysis

3 models in parallel

Three language models independently rate the same text on all six dimensions and extract the detail layers in parallel: sub-category findings, actor mentions with sentiment, and coreference clusters. A chain-of-verification (5 control questions) reduces hallucinations.

gpt-5.4-minimistral-large-2512deepseek-v4-flash
03

Aggregation

Median + agreement

Numeric scores: median across the three models. Labels: majority vote. The spread of the individual judgments is reported per dimension as model agreement — it stays visible instead of vanishing into the average.

MedianMajority voteAgreement report
04

Verbatim gate & markup

Evidence checked literally

Every finding must carry a quote that appears exactly in the original text — otherwise the evidence is discarded (the assessment remains marked as model-based). Verified evidence is highlighted in the text with a character offset. The result is a permalink with a 30-day TTL.

exact string matchchar_offset markupPermalink 30 days

One analyzer, two uses

The same analyzer code runs in two contexts:

  • On-demand (via /analyse): the actual analysis of a submitted text — with a permalink, 30-day TTL.
  • Reference corpus (internal): a continuously co-analyzed corpus serves exclusively to calibrate the scale. It produces no public outlet profiles and does not feed into individual user analyses.

Ensemble

Three models, because none alone is trustworthy.

Language models have their own, model-specific biases. Klyptra picks three models from three different pre-training pipelines so that the blind spots of a single model become visible through the others. Aggregation is by median (numeric) and majority vote (labels) — all models carry equal weight, none is preferred.

GPT-5.4 mini

OpenAI (USA)

A lean, low-latency variant of the GPT-5.4 line — it carries the on-demand analysis without giving up the score consistency of the larger models.

Mistral Large 2512

Mistral AI (France / EU)

A European pre-training pipeline from an independent provider — it brings a different training basis than the US models.

DeepSeek V4 Flash

DeepSeek (China)

A third, independent pre-training corpus — if this model deviates from the consensus, the spread becomes visible as model agreement.

Aggregation

Numeric scores

Median across all 3 models. Robust against outliers of a single model.

Categorical labels

Majority vote. On a 1:1:1 split, the label falls back to the score nearest the median.

Model agreement

The spread of the three judgments is reported per dimension as high, medium or low agreement — separate from the model's confidence.

Scientific foundations

A peer-reviewed foundation, one methodology.

Klyptra is not a self-construction. Every methodological layer references a peer-reviewed source with a DOI. The differentiation from BiasScanner (our direct predecessor) lies in the three-model ensemble, the German-language specialization, the verbatim gate and the PFA layer.

Concept & 6 dimensions

Spinde et al. (2023) — Media Bias Taxonomy

The six dimensions operationalize the bias types of the Media Bias Taxonomy. BABE (Spinde et al. 2021, EMNLP-Findings) serves as an expert-annotated validation benchmark (binary biased/neutral) and IRR yardstick — the 0–10 scale itself is Klyptra's operationalization, not part of BABE.

Spinde et al. (2023) Media Bias Taxonomy, ACM Comput. Surv., arXiv:2312.16148 · BABE: EMNLP-Findings 2021, DOI: 10.18653/v1/2021.findings-emnlp.101 · consolidated in Spinde (2025), Springer, Open Access, DOI: 10.1007/978-3-658-47798-1

→ 6 dimensions

27 sub-patterns

Menzner & Leidner (2024) — BiasScanner taxonomy

The 27 specific bias categories are placed as a sub-layer beneath the six Klyptra dimensions. BiasScanner is Klyptra's direct scientific predecessor. Not part of Spinde's work — an independent foundation.

Menzner & Leidner (2024) “Improved Models for Media Bias Detection and Subcategorization”, NLDB 2024, pp. 181–196. DOI: 10.1007/978-3-031-70239-6_13

→ 27 sub-patterns

Actor layer

Felix Hamborg (2023) — Person-Oriented Framing Analysis

PFA-light extracts named actors with mention counts, sentiment and framing devices. The cross-person disparity makes systematic asymmetries visible.

Hamborg (2023) “Revealing Media Bias in News Articles”, Springer. DOI: 10.1007/978-3-031-17693-7

→ PFA section

Limits

What Klyptra does not do.

Methodological honesty requires naming the weak points. This list is not exhaustive — contributions are welcome.

LLMs as annotators

Language models have documented biases (Horych et al. 2025). Ensemble + verbatim requirement reduce this but do not eliminate it. Klyptra is not an arbiter — it is a systematic, verifiable indicator.

A snapshot of one text

Klyptra rates exactly the submitted text — not the outlet, newsroom or author behind it. A single result is not a verdict on an outlet; no general “tendency” of a source can be derived from one analysis.

Language level, not factual accuracy

Klyptra measures language and framing — not whether claimed facts are correct. Fact-checking is a separate task (see Correctiv, dpa fact-check).

Evidence without literal coverage is discarded

The verbatim gate keeps only quotes that appear exactly in the text. Paraphrases are not output as evidence — some dimensions therefore deliberately appear without an evidence quote and are marked as a model-based assessment. Honesty over forced evidence.

Detail layers not yet ensemble-aggregated

The six top-level scores are aggregated across all three models (median) and reported with their spread. The detail layers (sub-categories, actor analysis, coreference) currently come from one of the three models — a true union/dedupe aggregation across all models is planned as the next stage. A deliberate MVP decision, not a bug.

German-language specialization

The three models are primarily pre-trained in English. Idiom, subjunctive discipline and irony detection can be thinner in German than in English. Few-shot examples compensate in part; a systematic German ground-truth evaluation is on the research roadmap.

Genre bias: commentary as news

The six dimensions are calibrated for news reporting. If a commentary is submitted, word choice and emotional balance swing strongly, as expected — Klyptra cannot yet reliably tell whether a text should be classified as a report or a commentary. A genre detection as a pre-stage is documented in the methodology roadmap.

Bias annotation is constitutively subjective

Even trained experts reach only an agreement of Krippendorff α ≈ 0.40 on bias labels (Spinde 2025, Ch. 4) — that is the documented maximum of the domain, not a weakness signal. There is no strong “ground truth”; Klyptra's consistency aim targets this expert level, not objective truth.

Technically measurable is not socially relevant

Automated methods can report patterns that are statistically tangible but substantively irrelevant (Spinde 2025, Ch. 8). A score is a systematic indicator, not a final verdict on the significance of a text.

Your own standpoint colors perception

Readers perceive texts that contradict their position as more biased than comparable texts on their own side (hostile-media effect). Studies show that even bias visualizations do not resolve this effect (Spinde 2025, Ch. 7) — a result is read filtered through one's own stance.

Political classification is not a bias statement

Spinde (2025, Ch. 7) shows that a political classification does not increase bias perception — it communicates stance, not distortion. Klyptra's political, economic and social descriptors are a tendency indication and do not feed into the objectivity score.

Version history

When what changed.

Every single analysis carries a methodology_version tag. Existing analyses keep their version — methodology updates create no score drift in historical data.

v1.0
April 2026
6 top-level dimensions + verbatim-quote gate + multi-model ensemble. Scientific basis: Media Bias Taxonomy (Spinde et al. 2023), validated against BABE (2021).
v1.1
May 2026
Added: 27 bias sub-categories (BiasScanner), Person-Oriented Framing Analysis (PFA-light after Felix Hamborg), coreference documentation. Existing v1.0 analyses remain valid — the new layers are additive.

Reproducibility

Methodology, data, prompts.

Klyptra discloses its methodology: model versions, bias dimensions and the underlying research are documented on this page. Every analysis also carries a signature (system-prompt hash) that records its exact methodological state.

Methodology

Bias dimensions, scale and aggregation logic are described in full on this page.

Data

Every analysis is exportable as JSON and PDF for its creator.

Prompts

Versioned prompt templates incl. few-shot examples. Diff log for every change.