A collection of interesting data sets to learn from - 1.7 billion Reddit comments, how America injures itself and many others!

Data Is Plural — Structured Archive

2015.10.21 1 Every place name in the United States. Sometimes, bureaucracy creates poetry. Since 1890, the U.S. Board on Geographic Names has been cataloguing, standardizing, and promulgating official names for the places we hike, swim, work, and call home. Along the way, it began publishing Geographic Names Information System (GNIS), a searchable and downloadable database containing all of its domestic nomenclature. In Alaska alone, the database lists names for 167 dams, 303 post offices, 666 glaciers, 2,704 capes, and 9,575 streams. My favorite: Confusion Creek. [h/t @emilymbadger],+Alaska/@68.4510925,-152.0233116,15.94z/data=!4m2!3m1!1s0x50d80cfac6a29911:0xc46bfa2a83d54866

2015.10.21 2 “There’s finally federal data on low-income college graduation rates—but it’s wrong.” The Hechinger Report casts doubt on the Pell grant graduation numbers contained in the Department of Education’s recently-released College Scorecard. Why the discrepancy? “[W]hile schools are required by law to provide the graduation rates of Pell recipients to any applicants who ask, a loophole protects them from having to report the same figures to the government.” Oof.

2015.10.21 3 What police-related data does your city publish? The Police Open Data Census, created by Code for America fellows in Indianapolis, is tracking “currently available open datasets about police interactions with citizens in the US," including officer-involved shootings, use of force, and citizen complaints. The census currently covers 36 police departments. Related: The NYPD says it will start tracking all officer use-of-force incidents — not just gunfire — next year, the New York Times reports.

2015.10.21 4 How often do Wikipedia editors edit? The Wikimedia Foundation has published a dataset enumerating monthly revision counts for every editor, across all of its wikis. The foundation is asking for help investigating a few perplexing trends. For example: Why have the number “very active editors” — those with 100+ edits per month — increased while the number of merely “active” editors have plateaued?

2015.10.21 5 Four years of rejected license plates. WNYC, through a freedom-of-information request to the New York DMV, obtained a list of vanity plate approvals and denials from late 2010 to late 2014. Among the denials: “RUBMYDUB,” “S5SS5S5S,” “RFLMAO,” and “CBSNEWS.” (Strangely, “NBC4” was approved. Go figure.) The files and related story were published in August, but the data are timeless. [h/t @veltman]

2015.10.28 1 Data-shaming the robocallers. If you can’t beat ‘em, post spreadsheets about ‘em. Earlier this month, the Federal Communications Commission started publishing a dataset of complaints against telemarketers and robocalls. The FCC says the file will be updated weekly. It’s already being put to use: A clever programmer has crammed all the offending numbers into a single phone “contact” so that you can block them all at once. [h/t Shale Craig]

2015.10.28 2 The demographics of traffic stops. This weekend, the New York Times published a front-page article on “the disproportionate risk of driving while black.” Among other findings: “officers were more likely to conduct [searches] when the driver was black, even though they consistently found drugs, guns or other contraband more often if the driver was white.” The investigation drew on several statewide traffic-stop datasets that track the race and gender of stopped drivers. The “seven states with the most sweeping reporting requirements,” in order of how easy it seems (to me) to get detailed data: Connecticut, North Carolina, Missouri, Nebraska, Maryland, Illinois, and Rhode Island.

2015.10.28 3 Where do Americans spend their days? Most population numbers tell you where people live. But legions of Americans commute for work across city, county, and state lines. The Census Bureau’s Commuter-Adjusted Daytime Population Data accounts for these daily migrations. Manhattan’s population (non-tourist) population doubles from 1.5 million to 3 million, by far the largest influx by raw numbers. But Lake Buena Vista, Fla., takes the percentage-growth prize. The city’s entire resident population could fit in two sedans, but its “daytime population” includes 33,000 workers — including a not-insubstantial number dressed as Mickey Mouse. [h/t Steven Romalewski],_Florida

2015.10.28 4 Finally, free access to detailed U.S. import/export data. Prior to October 15th, the Census Bureau’s USA Trade Online tool cost $300/year. No longer. The newly-free dataset covers more than 17,000 commodities, including a category for “magic tricks, practical joke articles; parts and accessories.” [h/t Noah Veltman]

2015.10.28 5 Porn. is on a mission: “to contribute to human sexuality understanding through a Big Data approach.” Last year, the site posted detailed metadata on 800,000 adult videos, including titles, descriptions, view counts, and tags. It powers Porngram, an only-kinda-safe-for-work charting tool.

2015.11.04 1 Maternity leave policies at hundreds of American companies. The 600+ entries in this searchable, sortable database range from 3M to Amazon to Zynga, and list both paid and unpaid leave. The database, run by the women-in-the-workplace website, culls from published policies and employee tips. An introductory blog post provides more information.

2015.11.04 2 MoMA, mo’ data. This July, the Museum of Modern Art published a dataset containing 120,000 artworks from its catalog, joining the UK’s Tate, the Smithsonian’s Cooper Hewitt, and other forward-thinking museums. The MoMA data contains the names of the artwork and artist, the dates created and acquired, and the medium — but no images. Related: Artist Jer Thorp encourages you to “perform” the data. Also related: Every museum in the United States. [h/t Nadja Popovich]

2015.11.04 3 All licensed firearm dealers since 2010. The Bureau of Alcohol, Tobacco, Firearms, and Explosives publishes a searchable and downloadable licensing database. License-holders fall into eleven categories. Among them: run-of-the-mill dealers, ammunition manufacturers, collectors of “curios and relics,” pawnbrokers, and importers of “destructive devices.” The ATF’s website contains monthly and state-by-state archives. [h/t Marc DaCosta] [Correction, 2015-11-04: There are only nine categories of license-holders. The published ATF data includes only eight of them; it does not include "Collector of Curios and Relics." Thanks to @MikeStucka for flagging this mistake.]

2015.11.04 4 One thousand ways to say “dog.” Trans-New Guinea is the world’s third-largest language family. But it’s also among the poorest-studied., an online database launched in 2013, is trying to change that. It now contains more than 1,000 New Guinea languages and lists 145,000 word translations — including 1,065 entries for “dog.” It even has an API. A recent PLOS ONE journal article provides additional background and statistics. [h/t Simon J. Greenhill]

