Follow the Money
Free appendix from The Briefing: Tom Emmer (R-MN-06). Download as Markdown
This document is a set of directions for building a database of United States congressional campaign finance data from public federal sources. Every dollar raised, every dollar spent, every donor, every recipient — all of it is public record, published by the Federal Election Commission.
You can give these directions to any AI assistant and it will generate a working implementation in your preferred language. Or you can build it yourself. The data belongs to you. It always did.
What you will build
A relational database containing:
- Individual contributions — every reported donation from a person to a federal committee
- Committee contributions — every reported donation from one committee (PAC, party, etc.) to a candidate
- Transfers — every reported transfer between committees
- Operating expenditures — every reported disbursement by a committee
- Committee master file — every registered federal committee
- Candidate master file — every registered federal candidate
- Candidate-committee linkage — which committees belong to which candidates
Coverage varies by file type: individual contributions (indiv), committee contributions (pas2), and committee/candidate master files go back to the 1980 cycle. Transfers (oth) begin at the 2004 cycle. Operating expenditures (oppexp) begin at the 2014 cycle. Candidate-committee linkage (ccl) begins at the 2000 cycle. All are updated continuously by the FEC.
Where the data lives
All data is published by the Federal Election Commission at:
https://www.fec.gov/data/browse-data/?tab=bulk-data
Download these bulk data files:
| File prefix | Contents | FEC description |
|---|---|---|
cm | Committee master | All registered committees |
cn | Candidate master | All registered candidates |
ccl | Candidate-committee linkage | Committee-to-candidate mapping |
indiv | Individual contributions | Schedule A — contributions from persons |
pas2 | Committee contributions | Schedule A — contributions from committees to candidates |
oth | Transfers | Schedule A — transfers between committees |
oppexp | Operating expenditures | Schedule B — disbursements |
Each file type has one ZIP per 2-year election cycle (e.g., indiv24.zip for the 2024 cycle). Download all cycles you want to cover.
The FEC publishes column header files at: https://www.fec.gov/files/bulk-downloads/ (look for *_header_file.csv)
Files are pipe-delimited (|), no quoting, UTF-8 with occasional encoding errors.
Schema
Design your tables with these columns. Adapt types to your database engine. You may optionally add a surrogate id INTEGER PRIMARY KEY AUTOINCREMENT column to the transaction tables for internal row identification — sub_id remains the FEC-assigned deduplication key regardless.
committees
committee_id TEXT NOT NULL -- e.g., C00545749
cycle INTEGER NOT NULL -- election cycle
committee_name TEXT
treasurer_name TEXT
city TEXT
state TEXT
zip TEXT
designation TEXT -- A/B/D/J/P/U
committee_type TEXT -- H/N/P/Q/S/etc.
party TEXT
org_type TEXT
connected_org TEXT
candidate_id TEXT
PRIMARY KEY (committee_id, cycle)
candidates
candidate_id TEXT NOT NULL -- e.g., H4MN06087
cycle INTEGER NOT NULL -- election cycle
candidate_name TEXT
party TEXT
election_year INTEGER
state TEXT
office TEXT -- H=House, S=Senate, P=President
district TEXT
incumbent_challenger TEXT
status TEXT
principal_committee TEXT -- primary campaign committee ID
PRIMARY KEY (candidate_id, cycle)
candidate_committees
candidate_id TEXT NOT NULL
candidate_election_year INTEGER
fec_election_year INTEGER -- not the same as cycle
committee_id TEXT NOT NULL
committee_type TEXT
committee_designation TEXT
linkage_id INTEGER -- for dedup
PRIMARY KEY (candidate_id, committee_id, fec_election_year)
contributions_individual
cycle INTEGER NOT NULL
committee_id TEXT NOT NULL -- receiving committee
amendment_indicator TEXT -- N=new, A=amendment, T=termination
report_type TEXT -- YE, Q1, M7, 30G, etc.
transaction_type TEXT -- 15, 15E, 24T, etc.
entity_type TEXT -- IND, COM, ORG, etc.
name TEXT
city TEXT
state TEXT
zip5 TEXT
zip4 TEXT
employer TEXT
occupation TEXT
transaction_date TEXT -- normalize to YYYY-MM-DD
amount REAL NOT NULL DEFAULT 0
other_id TEXT -- linked committee (earmarks)
memo_cd TEXT -- X=memo/sub-item
memo_text TEXT
file_num INTEGER -- FEC filing number
sub_id INTEGER NOT NULL UNIQUE -- FEC record ID
contributions_committee
Same schema as contributions_individual, plus:
candidate_id TEXT -- receiving candidate
transfers
Same schema as contributions_individual.
disbursements
The disbursements table has a different structure from the other three transaction tables. The FEC oppexp file has its own column layout.
cycle INTEGER NOT NULL
committee_id TEXT NOT NULL
amendment_indicator TEXT -- N=new, A=amendment, T=termination
report_year INTEGER -- calendar year of report
report_type TEXT -- YE, Q1, M7, 30G, etc.
line_number TEXT
form_type TEXT
schedule_type TEXT -- SB, SE, etc.
name TEXT -- payee name
city TEXT
state TEXT
zip5 TEXT
zip4 TEXT
transaction_date TEXT -- YYYY-MM-DD; source is MM/DD/YYYY
amount REAL NOT NULL DEFAULT 0
transaction_pgi TEXT -- primary/general indicator
purpose TEXT -- description of expenditure
category TEXT
category_description TEXT
memo_cd TEXT -- X=memo/sub-item
memo_text TEXT
entity_type TEXT
file_num INTEGER -- FEC filing number
sub_id INTEGER NOT NULL UNIQUE
Note: disbursements is the only table with report_year. This matters for amendment handling (see below). It also lacks several columns present in the other tables (employer, occupation, other_id, transaction_type), and its source file uses MM/DD/YYYY date format instead of MMDDYYYY.
Ingestion process
Step 1: Create tables
Create all tables without secondary indexes. Indexes slow bulk loading dramatically. You will add them after ingestion.
Include a UNIQUE constraint on sub_id for all four transaction tables. This is your primary deduplication key — the FEC assigns a globally unique identifier to every transaction record.
Step 2: Load committee, candidate, and linkage files
These are small relative to the transaction tables (thousands to tens of thousands of rows per cycle). Load cm, cn, and ccl files first. Use INSERT OR REPLACE keyed on the composite natural key ((committee_id, cycle) for committees, (candidate_id, cycle) for candidates) since the same entity appears across multiple cycle files with potentially different metadata.
Step 3: Load transaction files
For each of the four transaction types (indiv, pas2, oth, oppexp):
- Extract the ZIP file
- Parse the pipe-delimited text file using the FEC header definitions
- Normalize dates to
YYYY-MM-DD. The source format varies:indiv,pas2, andothfiles useMMDDYYYY;oppexp(disbursements) usesMM/DD/YYYY - Normalize ZIP codes: first 5 digits to
zip5, digits 6-9 tozip4 - Insert with
INSERT OR IGNOREonsub_id— duplicates are silently dropped - Commit in batches (10,000-50,000 rows) for performance
CRITICAL — by_date duplication (indiv files only): Starting with the 2012 cycle, FEC indiv ZIP files contain BOTH a complete data file (itcont.txt) AND the same data split into date-range chunks in a by_date/ subdirectory. If you read all files in the ZIP, every record is processed twice. The UNIQUE constraint on sub_id prevents data corruption, but you waste half your processing time. Only read the main file (itcont.txt). Skip any files in by_date/ subdirectories. The other three file types (pas2, oth, oppexp) do not have this problem — their ZIPs contain only the main data file.
Step 4: Create indexes
After all data is loaded, create indexes on frequently queried columns:
committee_id(all four tables)cycle(all four tables)candidate_id(contributions_committee)transaction_date(all four tables)amount(all four tables)state,zip5,name(contributions_individual)other_id(transfers)memo_cd(all four tables)
On a 280-million-row table, each index takes 10-30 minutes. Plan accordingly.
Step 5: Validate
Run these checks after ingestion:
- Every
sub_idshould be unique (enforced by constraint, but verify) - No NULL
sub_idvalues - All dates should be
YYYY-MM-DDformat (10 characters) or NULL - ZIP codes should be 5 digits or NULL
- Row counts per cycle should be stable across runs (same source = same count)
Data quality issues you must handle
This section describes problems that are not obvious from the FEC documentation. If you skip these, your numbers will be wrong.
1. Amendment double-counting
The problem: When a committee amends a financial report, the FEC publishes new records alongside the originals. Both the original (amendment_indicator = 'N') and the amendment (amendment_indicator = 'A') remain in the bulk data. If you sum all records naively, you count every amended transaction twice.
Approximately 37-54% of all records across the four tables are amendments. This is not a minor issue.
Why you cannot simply delete originals: The intuitive approach — delete N records wherever A records exist for the same committee, cycle, and report type — is wrong. Report types like YE (year-end), Q1-Q3 (quarterly), and M2-M12 (monthly) repeat within a 2-year election cycle. A committee files YE twice per cycle: once for each calendar year. The grouping (committee_id, cycle, report_type) collapses two distinct filing periods into one bucket. If year one has an amendment and year two has an original, deleting the original destroys valid data from a different report period.
We confirmed this empirically: in the transfers table, over 98% of groupings containing both N and A records are cross-period collisions, not true amendments.
Why you cannot use file_num ordering: FEC file_num (filing number) is higher for more recent filings. But N records frequently have higher file_nums than A records within the same grouping — because they belong to a later report period, not because they are a later version of the same report. In the transfers table, 50% of mixed N/A groupings have N records with higher file_nums than A records — because those N records belong to a later report period, not because they are later corrections.
The correct approach: Keep all records. Filter at query time:
- For disbursements (which has a
report_yearcolumn), group by(committee_id, report_type, report_year)instead of(committee_id, cycle, report_type). This resolves 98.5% of collisions. - For the other three tables (no
report_year), use one of these strategies:- For a specific candidate's campaign committee: Within each
(committee_id, cycle, report_type), if A records exist, use them; otherwise use N. Small campaign committees rarely have cross-period collisions. - For large conduits (WinRed, ActBlue) or party committees: Check whether N and A records have overlapping or distinct date ranges before deciding. Distinct ranges mean both are valid.
- For maximum accuracy: Use
file_numto identify individual FEC filings within a grouping, and take only the records from the highestfile_numper filing period.
- For a specific candidate's campaign committee: Within each
2. Earmark double-counting
The problem: When a donor gives through a conduit like WinRed or ActBlue, the contribution appears twice in the data:
- Once as a recipient-side record on the candidate's committee (transaction type
15E— “earmarked contribution received”). This is the real contribution. - Once as a conduit-side record on WinRed/ActBlue (transaction type
24T— “earmarked contribution passed through”). This is the conduit's report of the same money.
If you sum both, you double-count every earmarked dollar.
The wrong filter: memo_cd = 'X' marks memo/sub-item records, but it covers only 0.5% of individual contribution records. Meanwhile, 48% of records contain “EARMARKED” in memo_text. The memo_cd field does not reliably identify all earmark duplicates.
The correct filter: Exclude transaction_type = '24T' (conduit pass-throughs) from candidate contribution totals. The 15E records on the candidate's own committee are the ones that count.
3. Memo entries in transfers
80% of records in the transfers table have memo_cd = 'X'. These are itemized sub-transactions that break down a parent transaction. When summing transfer amounts, filter to memo_cd IS NULL or memo_cd != 'X', or you will massively overcount.
4. Negative amounts
Negative amounts are legitimate: refunds, redesignations, reattributions, and corrections. They appear in all four tables (0.9-8.3% of records depending on the table). Include them for net totals. Exclude them (amount > 0) only for gross inflow analysis.
5. Amendment-only cycles
Some committees have cycles where only A records exist and no N records. For example, a candidate's first election cycle may consist entirely of amendment records if every original filing was subsequently amended. Any query that filters to amendment_indicator = 'N' will silently return zero for these cycles. Always use the amendment-preference approach described above rather than filtering to N-only.
Example: checking your work
After building the database, verify your understanding by querying a specific representative. Pick any member of Congress. Look up their campaign committee ID at fec.gov.
Here is an example using Representative Tom Emmer (R-MN-06), campaign committee C00545749:
-- Individual contributions to Emmer's campaign, by cycle
-- Excludes: pass-throughs (24T), memos (X), negatives
SELECT
cycle,
COUNT(*) AS contributions,
ROUND(SUM(amount), 2) AS total
FROM contributions_individual
WHERE committee_id = 'C00545749'
AND transaction_type != '24T'
AND (memo_cd IS NULL OR memo_cd != 'X')
AND amount > 0
GROUP BY cycle
ORDER BY cycle;
You should see contributions starting in the 2014 cycle (Emmer's first election) and growing over time, with a notable increase starting in 2022 as WinRed adoption drives up small-dollar transaction counts.
To see the WinRed effect:
-- WinRed vs direct contributions
SELECT
cycle,
SUM(CASE WHEN memo_text LIKE '%WINRED%'
THEN amount ELSE 0 END) AS via_winred,
SUM(CASE WHEN memo_text NOT LIKE '%WINRED%'
THEN amount ELSE 0 END) AS direct
FROM contributions_individual
WHERE committee_id = 'C00545749'
AND transaction_type != '24T'
AND (memo_cd IS NULL OR memo_cd != 'X')
AND amount > 0
GROUP BY cycle
ORDER BY cycle;
For Emmer, WinRed contributions should be zero through the 2020 cycle (WinRed launched mid-2019 but Emmer's committee didn't adopt it immediately), first appearing in the 2022 cycle and growing to nearly half of all individual contributions by 2024 (approximately 47% by dollar amount).
Now pick your own representative. Run the same queries. See who funds them.
Other public data sources
Campaign finance is one layer. The full picture of congressional behavior requires additional data. All of these are free and public.
Voting records
VoteView — https://voteview.com/data
Roll call votes for every member of Congress, every vote, back to the 1st Congress (1789). Published by UCLA. Download the CSV files (named HSall_members.csv, HSall_votes.csv, etc., or per-congress like H119_members.csv). This lets you calculate party loyalty scores, find which votes a member broke ranks on, and compare voting patterns across members.
Legislative activity
BILLSTATUS (Congress.gov bulk data) — https://github.com/unitedstates/congress
XML files for every bill introduced since the 108th Congress (2003). Includes sponsors, cosponsors, committee assignments, actions, and roll call linkages. Published by the Library of Congress. This lets you see what a member sponsors, how far their bills get, and what policy areas they focus on.
Lobbying
Senate Lobbying Disclosure — https://lda.senate.gov/filings/public/filing/search/
Every registered lobbyist, their clients, and how much they spend. Quarterly filings. Cross-reference with campaign contributions to see the relationship between lobbying expenditure and political donations to specific members.
Federal spending
USAspending.gov — https://www.usaspending.gov/download_center/award_data_archive
Every federal contract, grant, loan, and direct payment. Searchable by congressional district. This lets you see what federal money flows into a district and whether the representative voted for or against the programs that fund it.
Economic and demographic data
Census Bureau American Community Survey — https://data.census.gov
District-level demographics: income, employment, education, insurance coverage, veteran status, industry composition. Use table codes starting with DP (data profiles) or S (subject tables). Geography filter: congressional districts (116th-119th).
Bureau of Labor Statistics — https://www.bls.gov/data/
Employment, wages, CPI, industry data. National and state level.
FRED (Federal Reserve Economic Data) — https://fred.stlouisfed.org
Thousands of economic time series. Useful for contextualizing policy votes against economic conditions.
Health care
KFF (Kaiser Family Foundation) — https://www.kff.org/statedata/
State and congressional district health data: insurance coverage, Medicaid enrollment, marketplace premiums. Cross-reference with health care votes.
Congressional metadata
Congress.gov API — https://api.congress.gov
The official legislative data API from the Library of Congress. Members, bills, amendments, committee reports, nominations, treaties. Free API key required.
GovTrack — https://www.govtrack.us/developers
Legislative tracking with additional analysis layers. Ideology scores, leadership metrics, bill prognosis. Bulk data downloads available.
What to do with it
Pick a representative — yours. Download the data. Build the database. Run the queries.
See where their money comes from. See how they vote. See whether those two things are related.
Then do it for the next one. And the next one. Both parties. Every district.
The data is public. The pattern is there. You just have to look.