This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.

Where can I find large datasets open to the public?

Answer Wiki

Here are many of the links mentioned so far:
Cross-disciplinary data repositories, data collections and data search engines:
  1. https://www.kaggle.com/datasets
  2. http://usgovxml.com
  3. http://aws.amazon.com/datasets
  4. http://databib.org
  5. http://datacite.org
  6. http://figshare.com
  7. http://linkeddata.org
  8. http://reddit.com/r/datasets
  9. http://thewebminer.com/
  10. http://thedatahub.org alias http://ckan.net
  11. http://quandl.com
  12. Social Network Analysis Interactive Dataset Library (Social Network Datasets)
  13. Datasets for Data Mining
  14. http://enigma.io
  15. http://www.ufindthem.com/
  16. http://NetworkRepository.com - The First Interactive Network Data Repository
  17. http://MLvis.com
Single datasets and data repositories
  1. http://archive.ics.uci.edu/ml/
  2. http://crawdad.org/
  3. http://data.austintexas.gov
  4. http://data.cityofchicago.org
  5. http://data.govloop.com
  6. http://data.gov.uk/
  7. data.gov.in
  8. http://data.medicare.gov
  9. http://data.seattle.gov
  10. http://data.sfgov.org
  11. http://data.sunlightlabs.com
  12. https://datamarket.azure.com/
  13. http://developer.yahoo.com/geo/g...
  14. http://econ.worldbank.org/datasets
  15. http://en.wikipedia.org/wiki/Wik...
  16. http://factfinder.census.gov/ser...
  17. http://ftp.ncbi.nih.gov/
  18. http://gettingpastgo.socrata.com
  19. http://googleresearch.blogspot.c...
  20. http://books.google.com/ngrams/
  21. http://medihal.archives-ouvertes.fr
  22. http://public.resource.org/
  23. http://rechercheisidore.fr
  24. http://snap.stanford.edu/data/in...
  25. http://timetric.com/public-data/
  26. https://wist.echo.nasa.gov/~wist...
  27. http://www2.jpl.nasa.gov/srtm
  28. http://www.archives.gov/research...
  29. http://www.bls.gov/
  30. http://www.crunchbase.com/
  31. http://www.dartmouthatlas.org/
  32. http://www.data.gov/
  33. http://www.datakc.org
  34. http://dbpedia.org
  35. http://www.delicious.com/jbaldwi...
  36. http://www.faa.gov/data_research/
  37. http://www.factual.com/
  38. http://research.stlouisfed.org/f...
  39. http://www.freebase.com/
  40. http://www.google.com/publicdata...
  41. http://www.guardian.co.uk/news/d...
  42. http://www.infochimps.com
  43. http://www.kaggle.com/
  44. http://build.kiva.org/
  45. http://www.nationalarchives.gov....
  46. http://www.nyc.gov/html/datamine...
  47. http://www.ordnancesurvey.co.uk/...
  48. http://www.philwhln.com/how-to-g...
  49. http://www.imdb.com/interfaces
  50. http://imat-relpred.yandex.ru/en...
  51. http://www.dados.gov.pt/pt/catal...
  52. http://knoema.com
  53. http://daten.berlin.de/
  54. http://www.qunb.com
  55. http://databib.org/
  56. http://datacite.org/
  57. http://data.reegle.info/
  58. http://data.wien.gv.at/
  59. http://data.gov.bc.ca
  60. https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
  61. http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
  62. http://www.dati.gov.it
  63. http://dati.trentino.it
  64. http://www.databagg.com/
  65. http://networkrepository.com - Network/ML data repository w/ visual interactive analytics
  66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)
100+ Answers
Bret Taylor 
Bret Taylor, CEO of Quip. Ex-CTO of Facebook, co-founder FriendFeed, co-creator Google Maps.
47.7k ViewsUpvoted by Nikhil Garg, I lead a team of Quora engineers working on ML/NLP problems
I did a blog post about open data a long time ago (http://bret.appspot.com/entry/we...), and ReadWriteWeb did a nice roundup based on all the comments from the blog post: http://www.readwriteweb.com/arch....

Since that post, there have been a lot more comments on the blog (105 and counting), so you may want to comb the comments for any ones the RWW post missed.
Alex K. Chen
Alex K. Chen
153k ViewsUpvoted by Mark Meloon, Senior Data Scientist at Impetus
Alex is a Most Viewed Writer in Data Analysis.

Some others:


[1] The NLSY79 Geocode data can only be made available to users who have successfully completed a geocode application and signed a confidentiality agreement with the U.S. Bureau of Labor Statistics. If interested in gaining access to the NLSY79 Geocode data, please review the information at http://stats.bls.gov/nls/nlsgeo7....
Jeff Hammerbacher 
Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at Techammer
383k ViewsUpvoted by Mark Meloon, Senior Data Scientist at Impetus
Jeff is a Most Viewed Writer in Datasets.
I'll try to restrict my answers to datasets greater than 1 GB in size, and order my answers by the size of the dataset.
More than 1 TB
  • The 1000 Genomes project makes 260 TB of human genome data available [13]
  • The Internet Archive is making an 80 TB web crawl available for research [17]
  • The TREC conference made the ClueWeb09 [3] dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
  • CNetS at Indiana University makes a 2.5 TB click dataset available [19]
  • ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
  • The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
  • The Reference Energy Disaggregation Data Set [12] has data on home energy use; it's about 500 GB compressed.
  • The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
  • The ImageNet dataset [18] is pretty big.
  • The MOBIO dataset [14] is about 135 GB of video and audio data
  • The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
  • Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
  • Yandex has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
  • Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
  • The Open American National Corpus [8] is about 4.8 GB uncompressed.
  • Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
  • The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
  • The wiki-links data made available by Google is about 1.75 GB total [20].
Shimonee Shah
Shimonee Shah, Quorious, Eccentric, Free spirited
35.9k ViewsUpvoted by Mark Meloon, Senior Data Scientist at Impetus
Here is a useful link.
Finding Data on the Internet

Finding Data on the InternetBy RevoJoe
 on October 6, 2011

The following list of data sources has been modified as of 8/19/13. Most of the data sets listed below are free, however, some are not.

If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (Seeexamples :: intro for some code.) Otherwise, I have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what's out there.

Economics
Finance
Government
Health Care
Machine Learning
Public Domain Collections
Science
Social Sciences
Time Series
Universities
Nitin Madnani
Nitin Madnani, I science computers.
8.7k ViewsNitin is a Most Viewed Writer in Datasets.
Here are some big corpora we use in NLP in addition to the ones already mentioned:

  • ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger. There's also a parsed version called pukWac. Get both at: http://wacky.sslmit.unibo.it/dok...
  • WaCkypedia: a 2009 dump of the English Wikipedia (about 800 million tokens), including part of speech/lemma information, as well as a full syntactic parse. The texts were extracted from the dump and cleaned using the Wikipedia extractor. Get it at the same URL as ukWac: http://wacky.sslmit.unibo.it/dok...
  • USENET corpus:  A collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Get it at: http://www.psych.ualberta.ca/~we... [CAVEAT: it's huge!]
  • The collection of data that comes with the Natural Language Toolkit (NLTK). It's probably not as large as the others but it's a good set. See descriptions at: http://nltk.googlecode.com/svn/t...
  • Europarl: Proceedings of the European Parliament in 13 languages. Cleaned and pre-processed for machine translation research. Get it at: http://www.statmt.org/europarl [FYI, NLTK has a built-in interface to access this corpus.]
  • The Google Books Ngram corpus: Pretty big. Get it at: http://books.google.com/ngrams/d...
View More Answers