Skip to main content

Pardon the repost, but this is worth sharing. Sorry some links did not xfer.

---------- Forwarded message ----------
From: "Luke Rosiak"
Date: Jan 6, 2014 10:24 AM
Subject: [NICAR-L] Announcing, free and fully text-searchable Form 990s

My repository of 10 years of 990s, millions of which I have OCRd and made fully text-searchable, has reached a certain level of maturity and I think many of you will find it useful as a drop-in replacement for Guidestar and for much more, including an API. You can just search someone's name and find out what nonprofits they're paid by or board members of, or search the name of a nonprofit to find out what other nonprofits are giving to it.

A lengthier launch announcement follows.

In return for not paying taxes, nonprofits in the U.S. file detailed financial disclosures to the IRS, listing how much of their money goes to certain categories, how much they pay their top people and what groups they give money to. But even though large nonprofits submit structured electronic data, the IRS takes pains to convert it into paper copies and doesn’t make them available publicly at all, instead directing interested parties to request a copy from the organization itself.

Recently, tech pioneer Carl Malamud’s Public.Resource.Org began successfully filing Freedom of Information Act requests for all disclosures--990s, as they are called---and paying the IRS on a monthly basis for reams of DVDs with TIFF images. Some are scanned paper filings, for others the IRS went out of their way to turn structured data into a mere image. None has an embedded text layer.

The information is invaluable for philanthropists, journalists and competitors--and the universe of nonprofits is enormous, including the major sports leagues, political groups, hospitals and universities and quasi-public institutions.

So I began an enormous OCRing spree, using open-source tools and home-built software and put the results in elasticsearch and PostgreSQL on a free site. The effort, half the funding for which came thanks to a Sunlight Foundation OpenGov grant of $5,000, is called

Three or four years of nonprofit disclosures are now fully and instantly text-searchable, out of 7 million PDFs and that number will continue to grow. You can not only pull up 990s by typing in an organization’s name a lot faster (and freer) than Guidestar, but you can search across the inside text of all nonprofits.

Because nonprofits often disclose who they give to, but not who they get from, searching an organization’s name turns up the filings of other, seemingly unaffiliated groups--essentially uncovering the previously secret donors to the first organization. You can also type a person’s name to see what boards they’re on and what groups they’re drawing a salary from. And a simple CTRL-F can navigate you to the part of the document you’re interested in, as opposed to reading through dozens or even hundreds of pages.

I’ve also pieced together more usable databases from poorly-documented and obscure IRS files, which are downloadable in SQL dumps, as is an index of the 7 million disclosures.

And there’s an API where you can pass an organization’s IRS-assigned ID for some structured data and extracted text -- it’s as simple as[EIN]/[API KEY].

The next stage of this project is to use regular expressions to extract structured data, where possible, from the text. (A more ambitious goal is to use the hOCR files, which give the bounding boxes of words, to deal with cases where we need to know exactly where the text was in a complicated page layout.) If you’re interested in either and have some familiarity with programming/regular expressions, please contact me.

It began at a civic hackathon in Washington, D.C. where Amazon had offered each participant $100 in AWS credits; I built a system where people could run a command against their credits that would spin up an EC2 instance and set up tesseract, an open-source OCR library, connect to an SQS queue and upload the results to S3. We managed to piece together several thousand dollars of free computing time, with up to 1,000 EC2 instances running.

But there are 7 million disclosures, sometimes in the hundreds of pages long each and OCRing is extremely computationally expensive and EC2 instances are quite weak. Worse, it turned out that the S3 costs were atrocious. has over three terabytes of data. So I built a top-of-the-line computer that will churn through documents for years, with power the main operating cost. I pieced together a 6-core, 14-terabyte machine with an overclocked, water-cooled Intel 3930K. It uses tesseract, PostgreSQL and a little Redis to manage its workload.

The text output is pushed out to an ElasticSearch server and accessed through a Django site. One challenge is whether the PDFs that CitizenAudit relies on will continue to be reliable. Public.Resource.Org’s ability to continue to obtain them is unclear. Having a free and open repository of all nonprofits’ 990s in PDF form is more important than and a precondition to, and funding for continuing to FOIA the forms from the IRS must be secured.

In all, this was a fairly simple process executed at a large scale--at least for a noncommercial side project in the public interest--but also one that endeavored mightily merely to reverse-engineer the devastation the IRS wreaked on the valuable structured data that filers submitted.

If the IRS is going to exempt certain--often enormous--groups from paying taxes, we need to know why. And the IRS needs to stop actively taking steps to make data less useful. This isn’t even a case of a legacy system--they have electronic data and won’t release it.

The IRS needs to put my site out of business and start providing bulk downloads of the structured-data form 990s that groups give it.
To unsubscribe from NICAR-L, please send "unsubscribe NICAR-L" in the body of an e-mail message to "". Please e-mail if you need help or have questions. ===========

Originally posted to dadadata on Thu Jan 09, 2014 at 05:55 PM PST.

Also republished by Maryland Kos.

Your Email has been sent.
You must add at least one tag to this diary before publishing it.

Add keywords that describe this diary. Separate multiple keywords with commas.
Tagging tips - Search For Tags - Browse For Tags


More Tagging tips:

A tag is a way to search for this diary. If someone is searching for "Barack Obama," is this a diary they'd be trying to find?

Use a person's full name, without any title. Senator Obama may become President Obama, and Michelle Obama might run for office.

If your diary covers an election or elected official, use election tags, which are generally the state abbreviation followed by the office. CA-01 is the first district House seat. CA-Sen covers both senate races. NY-GOV covers the New York governor's race.

Tags do not compound: that is, "education reform" is a completely different tag from "education". A tag like "reform" alone is probably not meaningful.

Consider if one or more of these tags fits your diary: Civil Rights, Community, Congress, Culture, Economy, Education, Elections, Energy, Environment, Health Care, International, Labor, Law, Media, Meta, National Security, Science, Transportation, or White House. If your diary is specific to a state, consider adding the state (California, Texas, etc). Keep in mind, though, that there are many wonderful and important diaries that don't fit in any of these tags. Don't worry if yours doesn't.

You can add a private note to this diary when hotlisting it:
Are you sure you want to remove this diary from your hotlist?
Are you sure you want to remove your recommendation? You can only recommend a diary once, so you will not be able to re-recommend it afterwards.
Rescue this diary, and add a note:
Are you sure you want to remove this diary from Rescue?
Choose where to republish this diary. The diary will be added to the queue for that group. Publish it from the queue to make it appear.

You must be a member of a group to use this feature.

Add a quick update to your diary without changing the diary itself:
Are you sure you want to remove this diary?
(The diary will be removed from the site and returned to your drafts for further editing.)
(The diary will be removed.)
Are you sure you want to save these changes to the published diary?

Comment Preferences

  •  Tip Jar (10+ / 0-)

    Thump! Bang. Whack-boing. It's dub!

    by dadadata on Thu Jan 09, 2014 at 05:55:10 PM PST

  •  Okay, ddd (1+ / 0-)
    Recommended by:

    'splain what we're looking at and what we can mine from this.

  •  Terrific resource for backgrounding corporations (3+ / 0-)
    Recommended by:
    eyo, dadadata, willyr

    990s are the tax forms corporations and nonprofits are supposed to fill out with income and expenditures and what they pay top honchos. 990s are supposed to list board members and top administrators. In the case of charities, they are supposed to report how much they take in and how much they pay out especially to the cause for which they raise mney. You can obtain quite a bit of information about a corporation or nonprofit from these tax forms, which are public. Luke Roskiak, who established on the web has done us all a big service by taking forms that were previously on paper and he's OCR'd them and put them on the web in a searchable form.

    What Is IRS Form 990?
    By Kelcey Lehrich, eHow Contributor
    What Is IRS Form 990? thumbnail   

    The Internal Revenue Service requires that all corporations in the United States file an income tax return, this includes non-profit corporations. Form 990 is the tax return form that non-profits use to report their charitable receipts for the year. A form 990 is to be used by any 501(c) organization. The IRS also has variations of form 990 such as 990-EZ based on the complexity and size of the charitable organization and their donations for the year.

    further detailshere

    On a 990,
  •  oooooh I lurrrve looking at 990's (1+ / 0-)
    Recommended by:

    SOme are harder to find than others. I recently ran across a local foundation that donates hundreds of thousands dollars to our local SPN affiliates, Marriage Law Foundation, Ruth Institute, Heritage fnd. etc… The organizations fighting Marriage equality in Utah.

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site