Measuring Escalation in Political Speech with LLMs: A Proposed Approach
Can political rhetoric be measured systematically enough to detect when discourse is hardening before offline consequences become visible?
As elected officials and politicians rely more heavily on social media, the volume and frequency of their posts can become a meaningful analytical signal. The core argument is simple: hate speech should not be treated as binary.

A cross-topic view of rhetorical escalation over time on selected topics mentioned in tweets by US president Trump.
In practice, rhetoric often escalates by stages. Disagreement becomes accusation, accusation becomes systematic negative characterization, and that can harden further into dehumanization or explicit advocacy of harm.
A Useful Escalation Scale
Babak Bahador's intensity framework is especially useful because it captures that gradient. In this formulation:
- Levels 1-3 function as early-warning categories.
- Level 4 marks dehumanization.
- Levels 5 and 6 move into violence and death-related rhetoric.
That structure is analytically valuable because it helps identify escalation before the most extreme threshold is crossed.
Why Use an LLM
Traditional keyword systems are useful for retrieval, but they are blunt instruments for interpretation. They struggle with tone, metaphor, implied targets, and the distinction between criticism of an institution and hostility directed at a broader group.
Used with a fixed rubric, a large language model can operate more like a trained annotator than a keyword filter. It can assign an intensity score, identify the target, and explain why a statement belongs at one level rather than another.
The point is not to outsource judgment to a model. It is to make judgment more consistent, and more scalable.
In the methodology used here, the model is constrained by a defined scale, logged settings, versioned prompts, and an explicit validation protocol. Reproducibility controls include locked model settings and prompt versioning. Human validation is built in through a stratified sample reviewed by independent coders, with agreement metrics used to test whether the model is performing as a credible classifier rather than an impressionistic one.
Case Study: Venezuela
To see the scale in practice, I applied it to Donald J. Trump's Venezuela-related social media activity in the run-up to the January 2026 military operation that culminated in Nicolas Maduro's capture.

Venezuela-related posts scored with the Bahador intensity scale.
One of the most important findings is the visible build-up in the months prior to the first strikes. Clusters of level-3 and level-4 statements begin appearing as early as 2024. The most useful signal is not simply that intensity rises. It is that the method creates a structured way to monitor rhetorical hardening over time before the most extreme threshold is crossed.
Key Takeaways
The result is straightforward: LLMs make it possible to read large volumes of political language with far more nuance than traditional keyword systems. For domain experts, the more important point is methodological. Pre-existing frameworks like Bahador's can be adapted and applied at scale for a fraction of the cost of full human labeling.
The complete Venezuela analysis cost less than $2 in model usage and ran in minutes. Even after factoring in human review, the time and cost improvement is many orders of magnitude more attractive than past approaches.
Limits
There are, of course, limits. Every generative AI model reflects biases from its training data and its learned weights. Language nuance also matters: different models may interpret the same words differently, especially across translation boundaries. Setting temperature to 0 helps, but it is not a foolproof way to eliminate hallucinations or reproducibility variance. These systems should be treated as disciplined classifiers inside a research design, not as neutral arbiters of meaning.
Downloads
A public version of the underlying Trump posts dataset is being made available through 3DL's resources work, alongside the broader effort to build open, policy-relevant data infrastructure and non-partisan analytical tools.,