I'll try to restrict my answers to datasets greater than 1 GB in size, and order my answers by the size of the dataset.
More than 1 TB
- The 1000 Genomes project makes 260 TB of human genome data available [13]
- The Internet Archive is making an 80 TB web crawl available for research [17]
- The TREC conference made the ClueWeb09 [3] dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
- ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
- CNetS at Indiana University makes a 2.5 TB click dataset available [19]
- ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
- The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
- The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
- The Reference Energy Disaggregation Data Set [12] has data on home energy use; it's about 500 GB compressed.
- The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
- The ImageNet dataset [18] is pretty big.
- The MOBIO dataset [14] is about 135 GB of video and audio data
- The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
- Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
- Yandex has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
- Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
- The Open American National Corpus [8] is about 4.8 GB uncompressed.
- Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
- The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
- The wiki-links data made available by Google is about 1.75 GB total [20].