# Follow the Money

*This appendix is available for free download at [helyn.com/follow-the-money](https://helyn.com/follow-the-money)*

This document is a set of directions for building a database of United States congressional campaign finance data from public federal sources. Every dollar raised, every dollar spent, every donor, every recipient — all of it is public record, published by the Federal Election Commission.

You can give these directions to any AI assistant and it will generate a working implementation in your preferred language. Or you can build it yourself. The data belongs to you. It always did.

---

## What you will build

A relational database containing:
- **Individual contributions** — every reported donation from a person to a federal committee
- **Committee contributions** — every reported donation from one committee (PAC, party, etc.) to a candidate
- **Transfers** — every reported transfer between committees
- **Operating expenditures** — every reported disbursement by a committee
- **Committee master file** — every registered federal committee
- **Candidate master file** — every registered federal candidate
- **Candidate-committee linkage** — which committees belong to which candidates

Coverage varies by file type: individual contributions (`indiv`), committee contributions (`pas2`), and committee/candidate master files go back to the 1980 cycle. Transfers (`oth`) begin at the 2004 cycle. Operating expenditures (`oppexp`) begin at the 2014 cycle. Candidate-committee linkage (`ccl`) begins at the 2000 cycle. All are updated continuously by the FEC.

---

## Where the data lives

All data is published by the Federal Election Commission at:

**https://www.fec.gov/data/browse-data/?tab=bulk-data**

Download these bulk data files:

| File prefix | Contents | FEC description |
|---|---|---|
| `cm` | Committee master | All registered committees |
| `cn` | Candidate master | All registered candidates |
| `ccl` | Candidate-committee linkage | Committee-to-candidate mapping |
| `indiv` | Individual contributions | Schedule A — contributions from persons |
| `pas2` | Committee contributions | Schedule A — contributions from committees to candidates |
| `oth` | Transfers | Schedule A — transfers between committees |
| `oppexp` | Operating expenditures | Schedule B — disbursements |

Each file type has one ZIP per 2-year election cycle (e.g., `indiv24.zip` for the 2024 cycle). Download all cycles you want to cover.

The FEC publishes column header files at:
**https://www.fec.gov/files/bulk-downloads/** (look for `*_header_file.csv`)

Files are pipe-delimited (`|`), no quoting, UTF-8 with occasional encoding errors.

---

## Schema

Design your tables with these columns. Adapt types to your database engine. You may optionally add a surrogate `id INTEGER PRIMARY KEY AUTOINCREMENT` column to the transaction tables for internal row identification — `sub_id` remains the FEC-assigned deduplication key regardless.

### committees
```
committee_id        TEXT NOT NULL  -- e.g., C00545749
cycle               INTEGER NOT NULL  -- election cycle
committee_name      TEXT
treasurer_name      TEXT
city                TEXT
state               TEXT
zip                 TEXT
designation         TEXT  -- A/B/D/J/P/U
committee_type      TEXT  -- H/N/P/Q/S/etc.
party               TEXT
org_type            TEXT
connected_org       TEXT
candidate_id        TEXT
PRIMARY KEY (committee_id, cycle)
```

### candidates
```
candidate_id        TEXT NOT NULL  -- e.g., H4MN06087
cycle               INTEGER NOT NULL  -- election cycle
candidate_name      TEXT
party               TEXT
election_year       INTEGER
state               TEXT
office              TEXT  -- H=House, S=Senate, P=President
district            TEXT
incumbent_challenger TEXT
status              TEXT
principal_committee TEXT  -- primary campaign committee ID
PRIMARY KEY (candidate_id, cycle)
```

### candidate_committees
```
candidate_id            TEXT NOT NULL
candidate_election_year INTEGER
fec_election_year       INTEGER  -- not the same as cycle
committee_id            TEXT NOT NULL
committee_type          TEXT
committee_designation   TEXT
linkage_id              INTEGER  -- for dedup
PRIMARY KEY (candidate_id, committee_id, fec_election_year)
```

### contributions_individual
```
cycle               INTEGER NOT NULL
committee_id        TEXT NOT NULL  -- receiving committee
amendment_indicator TEXT  -- N=new, A=amendment, T=termination
report_type         TEXT  -- YE, Q1, M7, 30G, etc.
transaction_type    TEXT  -- 15, 15E, 24T, etc.
entity_type         TEXT  -- IND, COM, ORG, etc.
name                TEXT
city                TEXT
state               TEXT
zip5                TEXT
zip4                TEXT
employer            TEXT
occupation          TEXT
transaction_date    TEXT  -- normalize to YYYY-MM-DD
amount              REAL NOT NULL DEFAULT 0
other_id            TEXT  -- linked committee (earmarks)
memo_cd             TEXT  -- X=memo/sub-item
memo_text           TEXT
file_num            INTEGER  -- FEC filing number
sub_id              INTEGER NOT NULL UNIQUE  -- FEC record ID
```

### contributions_committee
Same schema as contributions_individual, plus:
```
candidate_id        TEXT               -- receiving candidate
```

### transfers
Same schema as contributions_individual.

### disbursements
The disbursements table has a **different structure** from the other three transaction tables. The FEC `oppexp` file has its own column layout.
```
cycle               INTEGER NOT NULL
committee_id        TEXT NOT NULL
amendment_indicator TEXT  -- N=new, A=amendment, T=termination
report_year         INTEGER  -- calendar year of report
report_type         TEXT  -- YE, Q1, M7, 30G, etc.
line_number         TEXT
form_type           TEXT
schedule_type       TEXT  -- SB, SE, etc.
name                TEXT  -- payee name
city                TEXT
state               TEXT
zip5                TEXT
zip4                TEXT
transaction_date    TEXT  -- YYYY-MM-DD; source is MM/DD/YYYY
amount              REAL NOT NULL DEFAULT 0
transaction_pgi     TEXT  -- primary/general indicator
purpose             TEXT  -- description of expenditure
category            TEXT
category_description TEXT
memo_cd             TEXT  -- X=memo/sub-item
memo_text           TEXT
entity_type         TEXT
file_num            INTEGER  -- FEC filing number
sub_id              INTEGER NOT NULL UNIQUE
```

**Note:** `disbursements` is the only table with `report_year`. This matters for amendment handling (see below). It also lacks several columns present in the other tables (`employer`, `occupation`, `other_id`, `transaction_type`), and its source file uses `MM/DD/YYYY` date format instead of `MMDDYYYY`.

---

## Ingestion process

### Step 1: Create tables

Create all tables without secondary indexes. Indexes slow bulk loading dramatically. You will add them after ingestion.

Include a `UNIQUE` constraint on `sub_id` for all four transaction tables. This is your primary deduplication key — the FEC assigns a globally unique identifier to every transaction record.

### Step 2: Load committee, candidate, and linkage files

These are small relative to the transaction tables (thousands to tens of thousands of rows per cycle). Load `cm`, `cn`, and `ccl` files first. Use `INSERT OR REPLACE` keyed on the composite natural key (`(committee_id, cycle)` for committees, `(candidate_id, cycle)` for candidates) since the same entity appears across multiple cycle files with potentially different metadata.

### Step 3: Load transaction files

For each of the four transaction types (`indiv`, `pas2`, `oth`, `oppexp`):
1. Extract the ZIP file
2. Parse the pipe-delimited text file using the FEC header definitions
3. Normalize dates to `YYYY-MM-DD`. The source format varies: `indiv`, `pas2`, and `oth` files use `MMDDYYYY`; `oppexp` (disbursements) uses `MM/DD/YYYY`
4. Normalize ZIP codes: first 5 digits to `zip5`, digits 6-9 to `zip4`
5. Insert with `INSERT OR IGNORE` on `sub_id` — duplicates are silently dropped
6. Commit in batches (10,000-50,000 rows) for performance

**CRITICAL — by_date duplication (indiv files only):** Starting with the 2012 cycle, FEC `indiv` ZIP files contain BOTH a complete data file (`itcont.txt`) AND the same data split into date-range chunks in a `by_date/` subdirectory. If you read all files in the ZIP, every record is processed twice. The `UNIQUE` constraint on `sub_id` prevents data corruption, but you waste half your processing time. **Only read the main file** (`itcont.txt`). Skip any files in `by_date/` subdirectories. The other three file types (`pas2`, `oth`, `oppexp`) do not have this problem — their ZIPs contain only the main data file.

### Step 4: Create indexes

After all data is loaded, create indexes on frequently queried columns:
- `committee_id` (all four tables)
- `cycle` (all four tables)
- `candidate_id` (contributions_committee)
- `transaction_date` (all four tables)
- `amount` (all four tables)
- `state`, `zip5`, `name` (contributions_individual)
- `other_id` (transfers)
- `memo_cd` (all four tables)

On a 280-million-row table, each index takes 10-30 minutes. Plan accordingly.

### Step 5: Validate

Run these checks after ingestion:
- Every `sub_id` should be unique (enforced by constraint, but verify)
- No NULL `sub_id` values
- All dates should be `YYYY-MM-DD` format (10 characters) or NULL
- ZIP codes should be 5 digits or NULL
- Row counts per cycle should be stable across runs (same source = same count)

---

## Data quality issues you must handle

This section describes problems that are not obvious from the FEC documentation. If you skip these, your numbers will be wrong.

### 1. Amendment double-counting

**The problem:** When a committee amends a financial report, the FEC publishes new records alongside the originals. Both the original (`amendment_indicator = 'N'`) and the amendment (`amendment_indicator = 'A'`) remain in the bulk data. If you sum all records naively, you count every amended transaction twice.

Approximately 37-54% of all records across the four tables are amendments. This is not a minor issue.

**Why you cannot simply delete originals:** The intuitive approach — delete N records wherever A records exist for the same committee, cycle, and report type — is wrong. Report types like `YE` (year-end), `Q1`-`Q3` (quarterly), and `M2`-`M12` (monthly) repeat within a 2-year election cycle. A committee files `YE` twice per cycle: once for each calendar year. The grouping `(committee_id, cycle, report_type)` collapses two distinct filing periods into one bucket. If year one has an amendment and year two has an original, deleting the original destroys valid data from a different report period.

We confirmed this empirically: in the `transfers` table, over 98% of groupings containing both N and A records are cross-period collisions, not true amendments.

**Why you cannot use file_num ordering:** FEC `file_num` (filing number) is higher for more recent filings. But N records frequently have *higher* file_nums than A records within the same grouping — because they belong to a later report period, not because they are a later version of the same report. In the transfers table, 50% of mixed N/A groupings have N records with higher file_nums than A records — because those N records belong to a later report period, not because they are later corrections.

**The correct approach:** Keep all records. Filter at query time:
- For **disbursements** (which has a `report_year` column), group by `(committee_id, report_type, report_year)` instead of `(committee_id, cycle, report_type)`. This resolves 98.5% of collisions.
- For the other three tables (no `report_year`), use one of these strategies:
  - **For a specific candidate's campaign committee:** Within each `(committee_id, cycle, report_type)`, if A records exist, use them; otherwise use N. Small campaign committees rarely have cross-period collisions.
  - **For large conduits (WinRed, ActBlue) or party committees:** Check whether N and A records have overlapping or distinct date ranges before deciding. Distinct ranges mean both are valid.
  - **For maximum accuracy:** Use `file_num` to identify individual FEC filings within a grouping, and take only the records from the highest `file_num` per filing period.

### 2. Earmark double-counting

**The problem:** When a donor gives through a conduit like WinRed or ActBlue, the contribution appears twice in the data:
- Once as a **recipient-side record** on the candidate's committee (transaction type `15E` — "earmarked contribution received"). This is the real contribution.
- Once as a **conduit-side record** on WinRed/ActBlue (transaction type `24T` — "earmarked contribution passed through"). This is the conduit's report of the same money.

If you sum both, you double-count every earmarked dollar.

**The wrong filter:** `memo_cd = 'X'` marks memo/sub-item records, but it covers only 0.5% of individual contribution records. Meanwhile, 48% of records contain "EARMARKED" in `memo_text`. The `memo_cd` field does not reliably identify all earmark duplicates.

**The correct filter:** Exclude `transaction_type = '24T'` (conduit pass-throughs) from candidate contribution totals. The `15E` records on the candidate's own committee are the ones that count.

### 3. Memo entries in transfers

80% of records in the `transfers` table have `memo_cd = 'X'`. These are itemized sub-transactions that break down a parent transaction. When summing transfer amounts, filter to `memo_cd IS NULL` or `memo_cd != 'X'`, or you will massively overcount.

### 4. Negative amounts

Negative amounts are legitimate: refunds, redesignations, reattributions, and corrections. They appear in all four tables (0.9-8.3% of records depending on the table). Include them for net totals. Exclude them (`amount > 0`) only for gross inflow analysis.

### 5. Amendment-only cycles

Some committees have cycles where only A records exist and no N records. For example, a candidate's first election cycle may consist entirely of amendment records if every original filing was subsequently amended. Any query that filters to `amendment_indicator = 'N'` will silently return zero for these cycles. Always use the amendment-preference approach described above rather than filtering to N-only.

---

## Example: checking your work

After building the database, verify your understanding by querying a specific representative. Pick any member of Congress. Look up their campaign committee ID at fec.gov.

Here is an example using Representative Tom Emmer (R-MN-06), campaign committee `C00545749`:

```sql
-- Individual contributions to Emmer's campaign, by cycle
-- Excludes: pass-throughs (24T), memos (X), negatives
SELECT
    cycle,
    COUNT(*) AS contributions,
    ROUND(SUM(amount), 2) AS total
FROM contributions_individual
WHERE committee_id = 'C00545749'
  AND transaction_type != '24T'
  AND (memo_cd IS NULL OR memo_cd != 'X')
  AND amount > 0
GROUP BY cycle
ORDER BY cycle;
```

You should see contributions starting in the 2014 cycle (Emmer's first election) and growing over time, with a notable increase starting in 2022 as WinRed adoption drives up small-dollar transaction counts.

To see the WinRed effect:

```sql
-- WinRed vs direct contributions
SELECT
    cycle,
    SUM(CASE WHEN memo_text LIKE '%WINRED%'
        THEN amount ELSE 0 END) AS via_winred,
    SUM(CASE WHEN memo_text NOT LIKE '%WINRED%'
        THEN amount ELSE 0 END) AS direct
FROM contributions_individual
WHERE committee_id = 'C00545749'
  AND transaction_type != '24T'
  AND (memo_cd IS NULL OR memo_cd != 'X')
  AND amount > 0
GROUP BY cycle
ORDER BY cycle;
```

For Emmer, WinRed contributions should be zero through the 2020 cycle (WinRed launched mid-2019 but Emmer's committee didn't adopt it immediately), first appearing in the 2022 cycle and growing to nearly half of all individual contributions by 2024 (approximately 47% by dollar amount).

Now pick your own representative. Run the same queries. See who funds them.

---

## Other public data sources

Campaign finance is one layer. The full picture of congressional behavior requires additional data. All of these are free and public.

### Voting records
**VoteView** — https://voteview.com/data
Roll call votes for every member of Congress, every vote, back to the 1st Congress (1789). Published by UCLA. Download the CSV files (named `HSall_members.csv`, `HSall_votes.csv`, etc., or per-congress like `H119_members.csv`). This lets you calculate party loyalty scores, find which votes a member broke ranks on, and compare voting patterns across members.

### Legislative activity
**BILLSTATUS (Congress.gov bulk data)** — https://github.com/unitedstates/congress
XML files for every bill introduced since the 108th Congress (2003). Includes sponsors, cosponsors, committee assignments, actions, and roll call linkages. Published by the Library of Congress. This lets you see what a member sponsors, how far their bills get, and what policy areas they focus on.

### Lobbying
**Senate Lobbying Disclosure** — https://lda.senate.gov/filings/public/filing/search/
Every registered lobbyist, their clients, and how much they spend. Quarterly filings. Cross-reference with campaign contributions to see the relationship between lobbying expenditure and political donations to specific members.

### Federal spending
**USAspending.gov** — https://www.usaspending.gov/download_center/award_data_archive
Every federal contract, grant, loan, and direct payment. Searchable by congressional district. This lets you see what federal money flows into a district and whether the representative voted for or against the programs that fund it.

### Economic and demographic data
**Census Bureau American Community Survey** — https://data.census.gov
District-level demographics: income, employment, education, insurance coverage, veteran status, industry composition. Use table codes starting with `DP` (data profiles) or `S` (subject tables). Geography filter: congressional districts (116th-119th).

**Bureau of Labor Statistics** — https://www.bls.gov/data/
Employment, wages, CPI, industry data. National and state level.

**FRED (Federal Reserve Economic Data)** — https://fred.stlouisfed.org
Thousands of economic time series. Useful for contextualizing policy votes against economic conditions.

### Health care
**KFF (Kaiser Family Foundation)** — https://www.kff.org/statedata/
State and congressional district health data: insurance coverage, Medicaid enrollment, marketplace premiums. Cross-reference with health care votes.

### Congressional metadata
**Congress.gov API** — https://api.congress.gov
The official legislative data API from the Library of Congress. Members, bills, amendments, committee reports, nominations, treaties. Free API key required.

**GovTrack** — https://www.govtrack.us/developers
Legislative tracking with additional analysis layers. Ideology scores, leadership metrics, bill prognosis. Bulk data downloads available.

---

## What to do with it

Pick a representative — yours. Download the data. Build the database. Run the queries.

See where their money comes from. See how they vote. See whether those two things are related.

Then do it for the next one. And the next one. Both parties. Every district.

The data is public. The pattern is there. You just have to look.

---

*Download this appendix at [helyn.com/follow-the-money](https://helyn.com/follow-the-money)*
