← Helyn Research

Follow the Money

Free appendix from The Briefing: Tom Emmer (R-MN-06). Download as Markdown

This document is a set of directions for building a database of United States congressional campaign finance data from public federal sources. Every dollar raised, every dollar spent, every donor, every recipient — all of it is public record, published by the Federal Election Commission.

You can give these directions to any AI assistant and it will generate a working implementation in your preferred language. Or you can build it yourself. The data belongs to you. It always did.


What you will build

A relational database containing:

Coverage varies by file type: individual contributions (indiv), committee contributions (pas2), and committee/candidate master files go back to the 1980 cycle. Transfers (oth) begin at the 2004 cycle. Operating expenditures (oppexp) begin at the 2014 cycle. Candidate-committee linkage (ccl) begins at the 2000 cycle. All are updated continuously by the FEC.


Where the data lives

All data is published by the Federal Election Commission at:

https://www.fec.gov/data/browse-data/?tab=bulk-data

Download these bulk data files:

File prefixContentsFEC description
cmCommittee masterAll registered committees
cnCandidate masterAll registered candidates
cclCandidate-committee linkageCommittee-to-candidate mapping
indivIndividual contributionsSchedule A — contributions from persons
pas2Committee contributionsSchedule A — contributions from committees to candidates
othTransfersSchedule A — transfers between committees
oppexpOperating expendituresSchedule B — disbursements

Each file type has one ZIP per 2-year election cycle (e.g., indiv24.zip for the 2024 cycle). Download all cycles you want to cover.

The FEC publishes column header files at: https://www.fec.gov/files/bulk-downloads/ (look for *_header_file.csv)

Files are pipe-delimited (|), no quoting, UTF-8 with occasional encoding errors.


Schema

Design your tables with these columns. Adapt types to your database engine. You may optionally add a surrogate id INTEGER PRIMARY KEY AUTOINCREMENT column to the transaction tables for internal row identification — sub_id remains the FEC-assigned deduplication key regardless.

committees

committee_id        TEXT NOT NULL  -- e.g., C00545749
cycle               INTEGER NOT NULL  -- election cycle
committee_name      TEXT
treasurer_name      TEXT
city                TEXT
state               TEXT
zip                 TEXT
designation         TEXT  -- A/B/D/J/P/U
committee_type      TEXT  -- H/N/P/Q/S/etc.
party               TEXT
org_type            TEXT
connected_org       TEXT
candidate_id        TEXT
PRIMARY KEY (committee_id, cycle)

candidates

candidate_id        TEXT NOT NULL  -- e.g., H4MN06087
cycle               INTEGER NOT NULL  -- election cycle
candidate_name      TEXT
party               TEXT
election_year       INTEGER
state               TEXT
office              TEXT  -- H=House, S=Senate, P=President
district            TEXT
incumbent_challenger TEXT
status              TEXT
principal_committee TEXT  -- primary campaign committee ID
PRIMARY KEY (candidate_id, cycle)

candidate_committees

candidate_id            TEXT NOT NULL
candidate_election_year INTEGER
fec_election_year       INTEGER  -- not the same as cycle
committee_id            TEXT NOT NULL
committee_type          TEXT
committee_designation   TEXT
linkage_id              INTEGER  -- for dedup
PRIMARY KEY (candidate_id, committee_id, fec_election_year)

contributions_individual

cycle               INTEGER NOT NULL
committee_id        TEXT NOT NULL  -- receiving committee
amendment_indicator TEXT  -- N=new, A=amendment, T=termination
report_type         TEXT  -- YE, Q1, M7, 30G, etc.
transaction_type    TEXT  -- 15, 15E, 24T, etc.
entity_type         TEXT  -- IND, COM, ORG, etc.
name                TEXT
city                TEXT
state               TEXT
zip5                TEXT
zip4                TEXT
employer            TEXT
occupation          TEXT
transaction_date    TEXT  -- normalize to YYYY-MM-DD
amount              REAL NOT NULL DEFAULT 0
other_id            TEXT  -- linked committee (earmarks)
memo_cd             TEXT  -- X=memo/sub-item
memo_text           TEXT
file_num            INTEGER  -- FEC filing number
sub_id              INTEGER NOT NULL UNIQUE  -- FEC record ID

contributions_committee

Same schema as contributions_individual, plus:

candidate_id        TEXT               -- receiving candidate

transfers

Same schema as contributions_individual.

disbursements

The disbursements table has a different structure from the other three transaction tables. The FEC oppexp file has its own column layout.

cycle               INTEGER NOT NULL
committee_id        TEXT NOT NULL
amendment_indicator TEXT  -- N=new, A=amendment, T=termination
report_year         INTEGER  -- calendar year of report
report_type         TEXT  -- YE, Q1, M7, 30G, etc.
line_number         TEXT
form_type           TEXT
schedule_type       TEXT  -- SB, SE, etc.
name                TEXT  -- payee name
city                TEXT
state               TEXT
zip5                TEXT
zip4                TEXT
transaction_date    TEXT  -- YYYY-MM-DD; source is MM/DD/YYYY
amount              REAL NOT NULL DEFAULT 0
transaction_pgi     TEXT  -- primary/general indicator
purpose             TEXT  -- description of expenditure
category            TEXT
category_description TEXT
memo_cd             TEXT  -- X=memo/sub-item
memo_text           TEXT
entity_type         TEXT
file_num            INTEGER  -- FEC filing number
sub_id              INTEGER NOT NULL UNIQUE

Note: disbursements is the only table with report_year. This matters for amendment handling (see below). It also lacks several columns present in the other tables (employer, occupation, other_id, transaction_type), and its source file uses MM/DD/YYYY date format instead of MMDDYYYY.


Ingestion process

Step 1: Create tables

Create all tables without secondary indexes. Indexes slow bulk loading dramatically. You will add them after ingestion.

Include a UNIQUE constraint on sub_id for all four transaction tables. This is your primary deduplication key — the FEC assigns a globally unique identifier to every transaction record.

Step 2: Load committee, candidate, and linkage files

These are small relative to the transaction tables (thousands to tens of thousands of rows per cycle). Load cm, cn, and ccl files first. Use INSERT OR REPLACE keyed on the composite natural key ((committee_id, cycle) for committees, (candidate_id, cycle) for candidates) since the same entity appears across multiple cycle files with potentially different metadata.

Step 3: Load transaction files

For each of the four transaction types (indiv, pas2, oth, oppexp):

  1. Extract the ZIP file
  2. Parse the pipe-delimited text file using the FEC header definitions
  3. Normalize dates to YYYY-MM-DD. The source format varies: indiv, pas2, and oth files use MMDDYYYY; oppexp (disbursements) uses MM/DD/YYYY
  4. Normalize ZIP codes: first 5 digits to zip5, digits 6-9 to zip4
  5. Insert with INSERT OR IGNORE on sub_id — duplicates are silently dropped
  6. Commit in batches (10,000-50,000 rows) for performance

CRITICAL — by_date duplication (indiv files only): Starting with the 2012 cycle, FEC indiv ZIP files contain BOTH a complete data file (itcont.txt) AND the same data split into date-range chunks in a by_date/ subdirectory. If you read all files in the ZIP, every record is processed twice. The UNIQUE constraint on sub_id prevents data corruption, but you waste half your processing time. Only read the main file (itcont.txt). Skip any files in by_date/ subdirectories. The other three file types (pas2, oth, oppexp) do not have this problem — their ZIPs contain only the main data file.

Step 4: Create indexes

After all data is loaded, create indexes on frequently queried columns:

On a 280-million-row table, each index takes 10-30 minutes. Plan accordingly.

Step 5: Validate

Run these checks after ingestion:


Data quality issues you must handle

This section describes problems that are not obvious from the FEC documentation. If you skip these, your numbers will be wrong.

1. Amendment double-counting

The problem: When a committee amends a financial report, the FEC publishes new records alongside the originals. Both the original (amendment_indicator = 'N') and the amendment (amendment_indicator = 'A') remain in the bulk data. If you sum all records naively, you count every amended transaction twice.

Approximately 37-54% of all records across the four tables are amendments. This is not a minor issue.

Why you cannot simply delete originals: The intuitive approach — delete N records wherever A records exist for the same committee, cycle, and report type — is wrong. Report types like YE (year-end), Q1-Q3 (quarterly), and M2-M12 (monthly) repeat within a 2-year election cycle. A committee files YE twice per cycle: once for each calendar year. The grouping (committee_id, cycle, report_type) collapses two distinct filing periods into one bucket. If year one has an amendment and year two has an original, deleting the original destroys valid data from a different report period.

We confirmed this empirically: in the transfers table, over 98% of groupings containing both N and A records are cross-period collisions, not true amendments.

Why you cannot use file_num ordering: FEC file_num (filing number) is higher for more recent filings. But N records frequently have higher file_nums than A records within the same grouping — because they belong to a later report period, not because they are a later version of the same report. In the transfers table, 50% of mixed N/A groupings have N records with higher file_nums than A records — because those N records belong to a later report period, not because they are later corrections.

The correct approach: Keep all records. Filter at query time:

2. Earmark double-counting

The problem: When a donor gives through a conduit like WinRed or ActBlue, the contribution appears twice in the data:

If you sum both, you double-count every earmarked dollar.

The wrong filter: memo_cd = 'X' marks memo/sub-item records, but it covers only 0.5% of individual contribution records. Meanwhile, 48% of records contain “EARMARKED” in memo_text. The memo_cd field does not reliably identify all earmark duplicates.

The correct filter: Exclude transaction_type = '24T' (conduit pass-throughs) from candidate contribution totals. The 15E records on the candidate's own committee are the ones that count.

3. Memo entries in transfers

80% of records in the transfers table have memo_cd = 'X'. These are itemized sub-transactions that break down a parent transaction. When summing transfer amounts, filter to memo_cd IS NULL or memo_cd != 'X', or you will massively overcount.

4. Negative amounts

Negative amounts are legitimate: refunds, redesignations, reattributions, and corrections. They appear in all four tables (0.9-8.3% of records depending on the table). Include them for net totals. Exclude them (amount > 0) only for gross inflow analysis.

5. Amendment-only cycles

Some committees have cycles where only A records exist and no N records. For example, a candidate's first election cycle may consist entirely of amendment records if every original filing was subsequently amended. Any query that filters to amendment_indicator = 'N' will silently return zero for these cycles. Always use the amendment-preference approach described above rather than filtering to N-only.


Example: checking your work

After building the database, verify your understanding by querying a specific representative. Pick any member of Congress. Look up their campaign committee ID at fec.gov.

Here is an example using Representative Tom Emmer (R-MN-06), campaign committee C00545749:

-- Individual contributions to Emmer's campaign, by cycle
-- Excludes: pass-throughs (24T), memos (X), negatives
SELECT
    cycle,
    COUNT(*) AS contributions,
    ROUND(SUM(amount), 2) AS total
FROM contributions_individual
WHERE committee_id = 'C00545749'
  AND transaction_type != '24T'
  AND (memo_cd IS NULL OR memo_cd != 'X')
  AND amount > 0
GROUP BY cycle
ORDER BY cycle;

You should see contributions starting in the 2014 cycle (Emmer's first election) and growing over time, with a notable increase starting in 2022 as WinRed adoption drives up small-dollar transaction counts.

To see the WinRed effect:

-- WinRed vs direct contributions
SELECT
    cycle,
    SUM(CASE WHEN memo_text LIKE '%WINRED%'
        THEN amount ELSE 0 END) AS via_winred,
    SUM(CASE WHEN memo_text NOT LIKE '%WINRED%'
        THEN amount ELSE 0 END) AS direct
FROM contributions_individual
WHERE committee_id = 'C00545749'
  AND transaction_type != '24T'
  AND (memo_cd IS NULL OR memo_cd != 'X')
  AND amount > 0
GROUP BY cycle
ORDER BY cycle;

For Emmer, WinRed contributions should be zero through the 2020 cycle (WinRed launched mid-2019 but Emmer's committee didn't adopt it immediately), first appearing in the 2022 cycle and growing to nearly half of all individual contributions by 2024 (approximately 47% by dollar amount).

Now pick your own representative. Run the same queries. See who funds them.


Other public data sources

Campaign finance is one layer. The full picture of congressional behavior requires additional data. All of these are free and public.

Voting records

VoteViewhttps://voteview.com/data

Roll call votes for every member of Congress, every vote, back to the 1st Congress (1789). Published by UCLA. Download the CSV files (named HSall_members.csv, HSall_votes.csv, etc., or per-congress like H119_members.csv). This lets you calculate party loyalty scores, find which votes a member broke ranks on, and compare voting patterns across members.

Legislative activity

BILLSTATUS (Congress.gov bulk data)https://github.com/unitedstates/congress

XML files for every bill introduced since the 108th Congress (2003). Includes sponsors, cosponsors, committee assignments, actions, and roll call linkages. Published by the Library of Congress. This lets you see what a member sponsors, how far their bills get, and what policy areas they focus on.

Lobbying

Senate Lobbying Disclosurehttps://lda.senate.gov/filings/public/filing/search/

Every registered lobbyist, their clients, and how much they spend. Quarterly filings. Cross-reference with campaign contributions to see the relationship between lobbying expenditure and political donations to specific members.

Federal spending

USAspending.govhttps://www.usaspending.gov/download_center/award_data_archive

Every federal contract, grant, loan, and direct payment. Searchable by congressional district. This lets you see what federal money flows into a district and whether the representative voted for or against the programs that fund it.

Economic and demographic data

Census Bureau American Community Surveyhttps://data.census.gov

District-level demographics: income, employment, education, insurance coverage, veteran status, industry composition. Use table codes starting with DP (data profiles) or S (subject tables). Geography filter: congressional districts (116th-119th).

Bureau of Labor Statisticshttps://www.bls.gov/data/

Employment, wages, CPI, industry data. National and state level.

FRED (Federal Reserve Economic Data)https://fred.stlouisfed.org

Thousands of economic time series. Useful for contextualizing policy votes against economic conditions.

Health care

KFF (Kaiser Family Foundation)https://www.kff.org/statedata/

State and congressional district health data: insurance coverage, Medicaid enrollment, marketplace premiums. Cross-reference with health care votes.

Congressional metadata

Congress.gov APIhttps://api.congress.gov

The official legislative data API from the Library of Congress. Members, bills, amendments, committee reports, nominations, treaties. Free API key required.

GovTrackhttps://www.govtrack.us/developers

Legislative tracking with additional analysis layers. Ideology scores, leadership metrics, bill prognosis. Bulk data downloads available.


What to do with it

Pick a representative — yours. Download the data. Build the database. Run the queries.

See where their money comes from. See how they vote. See whether those two things are related.

Then do it for the next one. And the next one. Both parties. Every district.

The data is public. The pattern is there. You just have to look.


Download this appendix as Markdown