Mapping implicit knowledge graph structures in FEC data to straw donor schemes

Straw donor schemes are a prevalent form of election fraud that involve disguising the true source of a contribution to evade campaign finance laws. These schemes remain difficult to detect due to the complexity of tracing money through k-hop joins over as many as 10⁸ records in flat table format (aggregate FEC individual contribution data - the primary source), where the search space grows combinatorially with k. We propose a novel approach to detecting these schemes by natively encoding FEC individual contribution data as a directed property graph; then using LLMs to convert natural language questions into Cypher that can be executed against the knowledge graph. This setup enables multi-hop graph traversals that surface implicit relational structures (configurations not explicitly represented as nodes or edges in the schema, but that emerge through compositional querying) which are indicative of straw donor conduits and larger networks.

Last updated: January 2026

Code: github.com/YangJeffrey/kg-campaign-finance

Introduction

Under federal election law, individuals may contribute up to a statutory limit per candidate per election cycle. A straw donor scheme circumvents this cap by routing money through intermediaries who make contributions in their own names while the true source of the funds remains hidden. The motive may be either to exceed contribution limits or to conceal which committees or policies a donor is quietly supporting. These conduit arrangements violate 52 U.S.C. § 30122, which prohibits making a contribution in the name of another person.

Because detection depends on surfacing patterns across many transactions (shared addresses, synchronized donation timing, employer clustering) the tabular format obscures exactly the signals investigators need. We introduce FEC Graph, a system that reframes the detection problem by modeling contribution data as a property graph and using LLM-generated Cypher queries to expose the implicit relational structures that characterize straw donor networks.

Related Work

Natural language to SQL (NL2SQL) has emerged as the standard approach to making structured data accessible to non-technical users: an LLM translates a plain-English question into SQL and executes it against a relational database. DataTalk, developed by Stanford and Columbia, applies this pattern directly to FEC records. However, this model still imposes a fundamental ceiling. Traversing a path from donor to employer to PAC to candidate requires composing a chain of JOIN operations across normalized tables that encode no notion of adjacency, and because each additional hop introduces another JOIN, query complexity grows combinatorially with path length. This complexity also affects the LLM generation step: deeper join chains require the model to attend over more tables, columns, and foreign-key relationships, which can increase latency and error rates.

A parallel line of work, Text2Cypher, applies the same natural language-to-query paradigm to graph databases. Rather than generating SQL, the LLM produces Cypher (a declarative query language for graph databases) which natively supports variable-length path traversals, pattern matching, and relationship-type filtering. Recent benchmarks have shown that LLMs can generate Cypher with accuracy comparable to NL2SQL on equivalent tasks, while offering strictly greater expressiveness for multi-hop and structural queries. FEC Graph builds on this foundation by combining this approach with a domain-specific graph schema purpose-built for individual contributions in campaign finance.

FEC Graph

FEC Graph models all contribution data beforehand as a directed property graph so multi-hop relationships become native path queries for LLMs. This is a specifically more efficient structure for detecting patterns like straw/conduit donations, PTEs, and employer reimbursement rings because structural motifs such as cycles, fan-outs, and shared-address clusters are directly queryable. This also simplifies the LLM's generation task: a Cypher traversal references only the node labels, relationship types, and properties along the query path, rather than a full schema of normalized tables and foreign keys. Against this graph, the LLM can compile plain-English investigative questions into deterministic traversals (ex. "find employers with 2 or more employees that donated to committee X" → MATCH (d:Donor)-[:DONATED]->(:Committee) WITH d.employer, COUNT(DISTINCT d) AS n WHERE n >= 2 RETURN d.employer, n).

LPG construction: Raw individual contribution records are pulled from the FEC bulk data API, normalized, and loaded into a Neo4j property graph. The schema follows a bipartite structure: Donor → Donation → Committee , where a donor may have multiple donation edges to the same or different committees, preserving transactional multiplicity. Donor nodes carry name, city, state, ZIP code, employer, and occupation; committee nodes carry committee ID; donation edges carry amount, date, transaction type, entity type, and additional FEC metadata.

Enrichment & indexing: Committee nodes are enriched with party affiliation and filing information via the FEC Committee Master File. We create indexes on donor name, committee ID, and transaction ID to support fast lookup, merge, and deduplication operations at scale.

LLM-based KGQA: When a natural-language question arrives, the system retrieves the graph schema and conversation context and passes both to an LLM (Claude Opus 4.5 via LangChain), which generates a Cypher query. That query executes deterministically against the graph database where the language model compiles intent, it does not reason about campaign finance or produce probabilistic judgments. Results are reformatted into natural language and rendered in a chat interface exposed via a REST API, making the system accessible to journalists and investigators. The core pipeline is built on LangChain, whose modular chain and agent abstractions make it straightforward to swap LLM providers, inject additional retrieval steps, or extend the system with new tools in future work.

Experiment & Results

In November 2025, a federal grand jury indicted Rep. Sheila Cherfilus-McCormick on 15 counts including theft of millions in FEMA funds, money laundering conspiracy, and straw donor contributions. The DOJ's case centers on activity from 2021, but if similar patterns persisted into her more recent donors, a graph should surface them. We pointed FEC Graph at her committee's 2025 donations to find out.

Setup: SCM is of Miramar, Florida and the indictment describes recruiting friends and family as straw donors. All of the data below is restricted to the set of 2025 FEC filings. The contents below are intended to demonstrate FEC Graph's capabilities, and is not necessarily a claim of wrongdoing. As such, raw results have been anonymized to protect the privacy of the individuals involved.

Find donors to C00677492 who share the same address.

Surfaced 12 donors across 4 Florida locations, including a Pembroke Pines cluster of 4 donors at a single ZIP code.

Find all donors to C00677492 from Miramar, Florida.

Prosecutors said straw donors concentrated in Miramar. We surfaced 3 donors, with one of them sharing the exact ZIP code of the family home cited in the indictment. Their surname is phonetically similar to Cherfilus.

Find employers with 2 or more employees who donated to C00677492.

Two employers (Florida Crystals Corporation and Miami Dade County) each had multiple employees donating to SCM's committee.

What committees have donors to C00677492 also donated to? Order from highest crossover to lowest.

188 committees, with one committee having 41 shared donors.

Discussion

With no prior knowledge of the indictment's specifics, FEC Graph independently surfaced geographic clusters, shared-address groupings, employer overlap patterns, and cross-committee donation patterns consistent with the straw donor conduct described by federal prosecutors. That these signals emerged from a single committee's 2025 filings, years after the alleged scheme began, suggests that structural motifs in contribution data can persist well beyond their originating cycle and remain detectable through graph traversal.