Skip to content
Frank Bültge
EN/DE

Method sheet — The Consensus

What this is

Every outlet named here really ran these words. Counting that echo is the point: apparent consensus is not independent confirmation. "47 outlets report X" often means: one source, 47 times. Counter-measurement counts what passes as independent but is copied.

1. Sources & licences

News via the GDELT DOC 2.0 API (GDELT — open / frei nutzbar). Eight broad beats (politics, economy, technology, health, science, business, sports, weather), English-language. The cited "independent" sources are the domains that ran the sentence verbatim — listed by name on the work page.

https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/

2. Cadence

Daily. The machine selects on its own: the phrase with the widest spread across distinct source domains is the "headline of the day". Canonical artefact: versioned JSON in src/data/consensus/ — git is the archive.

3. Processing

Pool articles (dedupe by URL) → count verbatim 6-gram title phrases across distinct domains → the most replicated is the headline. Echo index = share of titles belonging to a ≥3-domain echo. Symbolic provenance: the earliest timestamp marks the source candidate and the cascade.

→ pipelines/consensus

AI/ML — staged, checkable

The lab experiments with data AND AI. Implemented: v1 verbatim baseline; v2 TF-IDF/cosine catches paraphrased coordination (reworded wire copy that verbatim misses); v3 a symbolic, rule-based classifier separates chain syndication from scattered placement from the graph structure (TLD homogeneity, time window) — auditable, no black-box model. Optional/future: deep embeddings and an LLM classifier verified against the graph (prompt disclosed). Condition throughout: every AI step transparent, output verified or marked as an estimate.

4. Limits of the method

5. Compute footprint

Eight lightweight HTTP fetches per day, no API key, no LLM in v1. The site itself is static.

6. Change log

→ To the experiment