Ultimately, this will be a decent directory of datasets and sources available via the Internet. Here's some starters...
- NYC OpenData; https://opendata.cityofnewyork.us/
- Revolution Analytics' Finding Data on the Internet
Medicare Provider Utilization and Payment Data: Physician and Other Supplier
- Hubway Dataset
- Not free and not "big", but decent looking person-based data at http://www.briandunning.com/sample-data/.
- open.whitehouse.gov
- http://data.gov
- AWS: http://aws.amazon.com/publicdatasets/
- US gov: http://www.data.gov
- UK gov: http://www.data.gov.uk
- UCI repository: http://archive.ics.uci.edu/ml/
- InfoChimps: http://www.infochimps.com/datasets
- NASA: http://data.nasa.gov
- http://census.gov
- Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html
- NYC Taxi Trips; http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
- FAA Flight Statistics; http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Obviously, the goal is to consume this stuff with Hadoop & Big Data!!