List of free and open datasets for machine learning or data research, arranged by topic. Lots of entries taken from https://github.com/awesomedata/awesome-public-datasets.
- Hyperspectral benchmark dataset on soil moisture
- U.S. Department of Agriculture's Nutrient Database
- U.S. Department of Agriculture's PLANTS Database - The Complete PLANTS
- 1000 Genomes - The 1000 Genomes Project ran between 2008 and 2015,
- American Gut (Microbiome Project)
- Broad Bioimage Benchmark Collection (BBBC) - The Broad Bioimage Benchmark
- Broad Cancer Cell Line Encyclopedia (CCLE)
- Cell Image Library
- Complete Genomics Public Data - A diverse data set of whole human genomes
- EBI ArrayExpress - ArrayExpress Archive of Functional Genomics Data
- EBI Protein Data Bank in Europe - The Electron Microscopy Data Bank
- ENCODE project - The Encyclopedia of DNA Elements (ENCODE) Consortium
- Electron Microscopy Pilot Image Archive (EMPIAR) - EMPIAR, the Electron
- Ensembl Genomes
- Gene Expression Omnibus (GEO) - GEO is a public functional genomics data
- Gene Ontology (GO) - GO annotation files
- Global Biotic Interactions (GloBI)
- Harvard Medical School (HMS) LINCS Project
- Human Genome Diversity Project
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- International HapMap Project
- Journal of Cell Biology DataViewer
- KEGG - KEGG is a database resource for understanding high-level functions
- MIT Cancer Genomics Data
- NCBI - The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information
- NCI Genomic Data Commons - The GDC Data Portal
- OpenSNP genotypes data
- Pathguid - Protein-Protein Interactions Catalog
- Protein Data Bank - This resource is powered by the Protein Data Bank
- Psychiatric Genomics Consortium - The purpose of the Psychiatric Genomics
- PubChem Project
- PubGene (now Coremine Medical)
- Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
- Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
- Sequence Read Archive(SRA) - The Sequence Read Archive (SRA) stores raw
- Stanford Microarray Data
- Stowers Institute Original Data Repository
- Systems Science of Biological Dynamics (SSBD) Database - Systems Science
- The Cancer Genome Atlas (TCGA), available via Broad GDAC
- The Catalogue of Life - The Catalogue of Life is a quality-assured
- The Personal Genome Project - The Personal Genome Project, initiated in
- UCSC Public Data
- UniGene
- Universal Protein Resource (UnitProt) - The Universal Protein Resource
- Rfam - The Rfam database is a collection of RNA families
- Plant databases
- Actuaries Climate Index
- Australian Weather
- Aviation Weather Center - Consistent, timely and accurate weather data
- Brazilian Weather - Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Dutch Weather - The KNMI Data Center (KDC)
- European Climate Assessment & Dataset
- Global Climate Data Since 1929
- Charting The Global Climate Change News Narrative 2009-2020
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- NOAA SURFRAD Meteorology and Radiation Datasets
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WU Historical Weather Worldwide
- WorldClim - Global Climate Data
- AMiner Citation Network Dataset
- CrossRef DOI URLs
- DBLP Citation dataset
- DIMACS Road Networks Collection
- NBER Patent Citations
- NIST complex networks data collection
- Network Repository with Interactive Exploratory Analysis Tools
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase
- Stanford Large Network Dataset Collection
- Stanford Longitudinal Network Data Sources
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCrawl 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- CRAWDAD Wireless datasets from Dartmouth Univ.
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- Criteo click-through data
- Internet-Wide Scan Data Repository
- MIRAGE-2019 - MIRAGE-2019 is a human-generated dataset for mobile traffic
- OONI: Open Observatory of Network Interference - Internet censorship data
- Open Mobile Data by MobiPerf
- The Peer-to-Peer Trace Archive - Real-world measurements play a key role
- Rapid7 Sonar Internet Scans
- UCSD Network Telescope, IPv4 /8 net
- Open-source datasets behind the graphics, interactives, and analyses at Google Trends
- Awesome Datasets About Datacenter
- Bruteforce Database
- Challenges in Machine Learning
- CrowdANALYTIX dataX
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- KDD Cup by Tencent 2012
- Kaggle Competition Data
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- TravisTorrent Dataset - MSR'2017 Mining Challenge
- TunedIT - Data mining & machine learning data sets, algorithms, challenges
- Yelp Dataset Challenge
- 38-Cloud (Cloud Detection) - Contains 38 Landsat 8 scene images and their
- AQUASTAT - Global water resources and uses
- BODC - marine data of ~22K vars
- EOSDIS - NASA's earth observing system data
- Earth Models
- Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements
- Marinexplore - Open Oceanographic Data
- Alabama Real-Time Coastal Observing System
- National Estuarine Research Reserves System-Wide Monitoring Program -
- Oil and Gas Authority Open Data - The dataset covers 12,500 offshore
- Smithsonian Institution Global Volcano and Eruption Database
- USGS Earthquake Archives
- ICGEM – Hosts gravity field spherical harmonic models and provides a webservice for generating grids of gravity functionals
- TerraNubis – The new Open Seismic Repository, includes the classic F3 and Penobscot seismic volumes
- awesome-satellite-imagery-datasets - List of satellite image training datasets with annotations for computer vision and deep learning
- Awesome Remote Sensing Change Detection - List of datasets, codes, papers, and contests related to remote sensing change detection
- Earth Online - Earth Observation information discovery platform
- DATA.NASA.GOV is NASA's clearinghouse site for open-data provided to the public
- Geopedia
- American Economic Association (AEA)
- EconData from UMD
- Economic Freedom of the World Data
- Historical MacroEconomic Statistics
- INFORUM - Interindustry Forecasting at the University of Maryland
- DBnomics – the world's economic database - Aggregates hundreds of
- International Trade Statistics
- Internet Product Code Database
- Joint External Debt Data Hub
- Jon Haveman International Trade Data Links
- Long-Term Productivity Database - The Long-Term Productivity database was
- OpenCorporates Database of Companies in the World
- Our World in Data
- SciencesPo World Trade Gravity Datasets
- The Atlas of Economic Complexity
- The Center for International Data
- The Observatory of Economic Complexity
- UN Commodity Trade Statistics
- UN Human Development Reports
- SIMBAD Astronomical Database - provides basic data, cross-identifications, bibliography and measurements for astronomical objects outside the solar system
- The Yelp dataset - a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes.
- College Scorecard Data
- New York State Education Department Data - The New York State Education
- Student Data from Free Code Camp
- Stanford Open Data Portal
- AMPds - The Almanac of Minutely Power dataset
- BLUEd - Building-Level fUlly labeled Electricity Disaggregation dataset
- COMBED
- DEL - Domestic Electrical Load study datsets for South Africa (1994 - 2014)
- ECO - The ECO data set is a comprehensive data set for non-intrusive load
- EIA
- Global Power Plant Database - The Global Power Plant Database is a
- HES - Household Electricity Study, UK
- HFED
- PEM1 - Proton Exchange Membrane (PEM) Fuel Cell Dataset
- PLAID - The Plug Load Appliance Identification Dataset
- The Public Utility Data Liberation Project (PUDL)
- REDD
- Smart Meter Data Portal
- Tracebase
- Ukraine Energy Centre Datasets
- UK-DALE - UK Domestic Appliance-Level Electricity
- WHITED
- iAWE
- NOPIMS – Open petroleum geoscience data from Western Australia made available by the Australian Government
- UK National Data Repository – Open petroleum geoscience data from the UK Government (free registration required)
- Athabasca Oil Sands Well Dataset McMurray/Wabiskaw – Well logs and stratigraphic picks for 2193 wells, including 750 with lithofacies, from Alberta, Canada
- Blockmodo Coin Registry - A registry of JSON formatted information files
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- NYSE Market Data ftp://ftp.nyxdata.com/
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- ArcGIS Open Data portal
- Cambridge, MA, US, GIS data on GitHub
- Database of all continents, countries, States/Subdivisions/Provinces and
- Factual Global Location Data
- IEEE Geoscience and Remote Sensing Society DASE Website
- Geo Maps - High Quality GeoJSON maps programmatically generated
- Geo Spatial Data from ASU
- Geo Wiki Project - Citizen-driven Environmental Monitoring
- GeoFabrik - OSM data extracted to a variety of formats and areas
- GeoNames Worldwide
- Global Administrative Areas Database (GADM) - Geospatial data organized
- Homeland Infrastructure Foundation-Level Data
- Landsat 8 on AWS
- List of all countries in all languages
- National Weather Service GIS Data Portal
- Natural Earth - vectors and rasters of the world
- OpenAddresses
- OpenStreetMap (OSM)
- Pleiades - Gazetteer and graph of ancient places
- Reverse Geocoder using OSM data
- Robin Wilson - Free GIS Datasets
- TIGER/Line - U.S. boundaries and roads
- TZ Timezones shapfiles
- TwoFishes - Foursquare's coarse geocoder
- UN Environmental Data
- World boundaries from the U.S. Department of State
- World countries in multiple formats
- High-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions
- Poseidon NW Australia – Interpreted 3D seismic (32bit) including reports and well logs
- Quantarctica – User-configurable QGIS basemap for Antarctica with high-quality, peer-reviewed, free and open Antarctic scientific data
- Awesome Open City Data - A curated list of open data sources to analyze and compare cities in a holistic way à la data science and empower citizens
- Global Open Data Index - provides the most comprehensive snapshot available of the state of open government data publication
- Alberta, Province of Canada
- Antwerp, Belgium
- Argentina (non official)
- Datos Argentina - Portal de datos abiertos de la República Argentina.
- Austin, TX, US
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Austria (data.gv.at)
- Baton Rouge, LA, US
- Beersheba, Israel - Open Data Portal (Smart7 OpenData)
- Belgium
- Boston, US
- Brazil
- Buenos Aires, Argentina
- Calgary, AB, Canada
- Cambridge, MA, US
- Canada
- Chicago
- Chile
- China
- Colombia
- Dallas Open Data
- DataBC - data from the Province of British Columbia
- District of Columbia, US
- Denver Open Data
- Durham, NC Open Data
- Edmonton, AB, Canada
- England LGInform
- European Union Open Data Portal
- EuroStat
- EveryPolitician - Ongoing project collating and sharing data on every
- Federal Committee on Statistical Methodology (FCSM) (formerly FedStats)
- Finland
- France
- Fredericton, NB, Canada
- Gatineau, QC, Canada
- Germany
- Ghent, Belgium
- Glasgow, Scotland, UK
- Greece
- Guardian world governments
- Halifax, NS, Canada
- Helsinki Region, Finland
- Hong Kong, China
- Houston, TX, US
- Indian Government Data
- Indonesian Data Portal
- Ireland's Open Data Portal
- Israel's Open Data Portal
- Istanbul Municipality Open Data Portal
- Italy - Il Portale dati.gov.it è il catalogo nazionale dei metadati
- Italy - Awesome Italian Public Datasets
- Japan
- Awesome Open Data Latam - Lista con informacion sobre datasets, organizaciones, eventos y herramientas relacionadas con Open Data en Latinoamerica
- Laval, QC, Canada
- Lexington, KY
- London Datastore, UK
- London, ON, Canada
- Los Angeles Open Data
- Luxembourg - Luxembourgish Open Data Portal
- MalaysiaFph
- MassGIS, Massachusetts, U.S.
- Melbourne, Australia - City of Melbourne Open Data
- Metropolitain Transportation Commission (MTC), California, US
- Mexico
- Missisauga, ON, Canada
- Moldova
- Moncton, NB, Canada
- Montreal, QC, Canada
- Mountain View, California, US (GIS)
- NYC Open Data
- NYC betanyc
- Netherlands
- New Zealand Stats
- New Zealand Data
- OECD
- Oakland, California, US
- Oklahoma
- Open Data for Africa
- Open Government Data (OGD) Platform India
- OpenDataSoft's list of 1,600 open data
- Oregon
- Ottawa, ON, Canada
- Palo Alto, California, US
- Philadelphia, US - OpenDataPhilly is a catalog of open data in the
- Portland, Oregon
- Portugal - Pordata organization
- Puerto Rico Government
- Quebec City, QC, Canada
- Quebec Province of Canada
- Queensland, Australia
- Regina SK, Canada
- Rio de Janeiro, Brazil
- Romania
- Russia
- San Diego, CA
- San Antonio, TX - Community Information Now - CI:Now is a nonprofit
- San Francisco Data sets
- San Jose, California, US
- San Mateo County, California, US
- Saskatchewan, Province of Canada
- Seattle
- Singapore Government Data
- Scotland - The National Library of Scotland
- South Africa Trade Statistics
- South Africa
- Spain
- State of Utah, US
- Switzerland
- Taiwan gov
- Taiwan - awesome-opendata-taiwan-gov
- Tel-Aviv Open Data
- Texas Open Data
- The World Bank
- Toronto, ON, Canada
- Tunisia
- Uniter Arab Emirates
- U.K. Government Data
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. National Center for Education Statistics (NCES)
- U.S. Open Government
- U.S. National Archives and Records Administration The National Archives and Records Administration (NARA) is the nation's record keeper. Of all documents and materials created in the course of business conducted by the United States Federal government, only 1%-3% are so important for legal or historical reasons that they are kept by us forever.
- UK 2011 Census Open Atlas Project
- U.S. Patent and Trademark Office (USPTO) Bulk Data Products
- Uganda Bureau of Statistics
- Ukraine
- United Nations
- Uruguay
- Valley Transportation Authority (VTA), California, US
- Vancouver, BC Open Data Catalog
- Victoria, BC, Canada
- Vienna, Austria
- U.S. Congressional Research Service (CRS) Reports
- Waterloo, CA - Datasets powering the Open Data API
- AWS COVID-19 Datasets - We're working with organizations who make
- 2019 Novel Coronavirus COVID-19 Data Repository by Johns Hopkins CSSE -
- Coronavirus (Covid-19) Data in the United States - The New York Times is
- Composition of Foods Raw, Processed, Prepared USDA National Nutrient Database for Standard
- EHDP Large Health Data Sets
- GDC - GDC supports several cancer genome programs for CCG, TCGA, TARGET etc.
- Gapminder World demographic databases
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- Open-ODS (structure of the UK NHS)
- OpenPaymentsData, Healthcare financial relationship data
- PhysioBank Databases - A large and growing archive of physiological data.
- The Cancer Imaging Archive (TCIA)
- The Cancer Genome Atlas project (TCGA)
- World Health Organization Global Health Observatory
- Informatics for Integrating Biology & the Bedside
- World Stress Map – A global compilation of information on the crustal present-day stress field
- openFDA - a research project to provide open APIs, raw data downloads, documentation and examples, and a developer community for an important collection of FDA public datasets
- Kushy Cannabis Dataset - a collection of tabular data from different sectors of the industry, from strains to products to lab results
- Substance Abuse and Mental Health Services Administration
- HealthData.gov
- WHO - Global Health Observatory data repository
- UNICEF
- Internet Archive Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.
- Censys Censys is the proven leader in Attack Surface Management by relentlessly searching and proactively monitoring your digital footprint far more broadly and deeply than ever thought possible.
- Stanford Internet Research Data Repository The Stanford Internet Research Data Repository is a public archive of research datasets that describe the hosts, services, and websites on the Internet.
- 10k US Adult Faces Database
- 2GB of Photos of Cats
- Adience Unfiltered faces for gender and age classification
- Affective Image Classification
- Animals with attributes
- CADDY Underwater Stereo-Vision Dataset of divers' hand gestures -
- Caltech Pedestrian Detection Benchmark
- Chars74K dataset - Character Recognition in Natural Images (both English
- Danbooru Tagged Anime Illustration Dataset - A large-scale anime image
- DukeMTMC Data Set - DukeMTMC aims to accelerate advances in multi-target
- Face Recognition Benchmark
- Flickr: 32 Class Brand Logos
- GDXray - X-ray images for X-ray testing and Computer Vision
- HumanEva Dataset - The HumanEva-I dataset contains 7 calibrated video
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- KITTI Vision Benchmark Suite
- Labeled Information Library of Alexandria - Biology and Conservation -
- MNIST database of handwritten digits, near 1 million examples
- Massive Visual Memory Stimuli, MIT
- Open Images From Google - Pictures with segmentation masks for 2.8
- SUN database, MIT
- SVIRO Synthetic Vehicle Interior Rear Seat Occupancy - 25.000 synthetic
- Several Shape-from-Silhouette Datasets
- Stanford Dogs Dataset
- The Action Similarity Labeling (ASLAN) Challenge
- The Oxford-IIIT Pet Dataset
- Violent-Flows - Crowd Violence / Non-violence Database and benchmark
- Visual genome
- YouTube Faces Database
- Open Images Dataset - a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories
- ocr-open-dataset - list all open dataset about ocr
- The European Monitoring Centre for Drugs and Drug Addiction
- FACE RECOGNITION HOMEPAGE
- ElephindThe goal of Elephind.com is to make it possible to search all the world’s online historic newspapers from one place. We aren’t there yet, but we are adding more newspapers every day.
- Google Newspapers
- The British Newspaper Archive Access hundreds of historic newspapers from all over Britain and Ireland
- Wall Street Journal Archive
- Irish Newspaper Archives The world’s largest and oldest online database of Irish newspapers.
- Wikipedia List of online newspaper archives This is a list of online newspaper archives and some magazines and journals, including both free and pay wall blocked digital archives.
- ABC News Archive
- Guardian News & Media archive The GNM archive collects and preserves original and unique documents and objects that tell the story of the Guardian and Observer, and can be visited by appointment
- AP Archive Relive iconic headlines that have shaped our world
- All-Age-Faces Dataset - Contains 13'322 Asian face images distributed
- Context-aware data sets from five domains
- Delve Datasets for classification and regression
- Discogs Monthly Data
- Free Music Archive
- IMDb Database
- Keel Repository for classification, regression and time series
- Labeled Faces in the Wild (LFW)
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- New Yorker caption contest ratings
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- YouTube-BoundingBoxes
- Youtube 8m
- eBay Online Auctions (2012)
- Datasets opened by Lithium Technologies | Klout
- Canada Science and Technology Museums Corporation's Open Data
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
- Automatic Keyphrase Extraction
- The Big Bad NLP Database
- Blizzard Challenge Speech - The speech + text data comes from
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Freebase of people, places, and things
- Google Books Ngrams (2.2TB)
- Google MC-AFP - Generated based on the public available Gigaword dataset
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- LJ Speech - Speech dataset consisting of 13,100 short audio clips of a
- M-AILabs Speech - The M-AILABS Speech Dataset is the first large dataset
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Making Sense of Microposts 2013 - Concept Extraction
- Making Sense of Microposts 2016 - Named Entity rEcognition and Linking
- Multi-Domain Sentiment Dataset (version 2.0)
- Noisy speech database for training speech enhancement algorithms and TTS
- Open Multilingual Wordnet
- POS/NER/Chunk annotated data
- Personae Corpus
- SMS Spam Collection in English
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- Stanford Question Answering Dataset (SQuAD)
- USENET postings corpus of 2005~2011
- Universal Dependencies
- Webhose - News/Blogs in multiple languages
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- WorldTree Corpus of Explanation Graphs for Elementary Science Questions -
- Allen Institute Datasets
- Brain Catalogue
- Brainomics
- CodeNeuro Datasets
- Collaborative Research in Computational Neuroscience (CRCNS)
- FCP-INDI
- Human Connectome Project
- NDAR
- NIMH Data Archive
- NeuroData
- NeuroMorpho - NeuroMorpho.Org is a centrally curated inventory of
- Neuroelectro
- OASIS
- OpenNEURO - A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data
- OpenfMRI
- Study Forrest
- CERN Open Data Portal
- Crystallography Open Database
- IceCube - South Pole Neutrino Observatory
- Ligo Open Science Center (LOSC) - Gravitational wave data from the LIGO
- NASA Exoplanet Archive
- NSSDC (NASA) data of 550 space spacecraft
- Sloan Digital Sky Survey (SDSS) - Mapping the Universe
- ATLAS Open Data provides open access to proton-proton collision data at the LHC
- EOPC-DE-Early-Onset-Prostate-Cancer-Germany - Early Onset Prostate Cancer
- GENIE - Data from the Genomics Evidence Neoplasia Information Exchange
- Genomic-Hallmarks-Prostate-Adenocarcinoma-CPC-GENE - Comprehensive
- MSK-IMPACT-Clinical-Sequencing-Cohort-MSKCC-Prostate-Cancer - Targeted
- Metastatic-Prostate-Adenocarcinoma-MCTP - Comprehensive profiling of 61
- Metastatic-Prostate-Cancer-SU2CPCF-Dream-Team - Comprehensive analysis of
- NPCR-2001-2015 - Database from CDC's National Program of Cancer
- NPCR-2005-2015 - Database from CDC's National Program of Cancer
- NaF-Prostate - NaF Prostate is a collection of F-18 NaF positron emission
- Neuroendocrine-Prostate-Cancer - Whole exome and RNA Seq data of
- PLCO-Prostate-Diagnostic-Procedures - The Prostate Diagnostic Procedures
- PLCO-Prostate-Medical-Complications - The Prostate Medical Complications
- PLCO-Prostate-Screening-Abnormalities - The Prostate Screening
- PLCO-Prostate-Screening - The Prostate Screening dataset (177,315
- PLCO-Prostate-Treatments - The Prostate Treatments dataset (13,409
- PLCO-Prostate - The Prostate dataset is a comprehensive dataset that
- PRAD-CA-Prostate-Adenocarcinoma-Canada - Prostate Adenocarcinoma -
- PRAD-FR-Prostate-Adenocarcinoma-France - Prostate Adenocarcinoma -
- PRAD-UK-Prostate-Adenocarcinoma-United-Kingdom - Prostate Adenocarcinoma
- PROSTATEx-Challenge - Retrospective set of prostate MR studies. All
- Prostate-3T - The Prostate-3T project provided imaging data to TCIA
- Prostate-Adenocarcinoma-Broad-Cornell-2012
- Prostate-Adenocarcinoma-Broad-Cornell-2013
- Prostate-Adenocarcinoma-CNA-study-MSKCC - Copy-number profiling of 103
- Prostate-Adenocarcinoma-Fred-Hutchinson-CRC
- Prostate Adenocarcinoma (MSKCC/DFCI) - Whole Exome Sequencing of 1013
- Prostate-Adenocarcinoma-MSKCC - MSKCC Prostate Oncogenome Project. 181
- Prostate-Adenocarcinoma-Organoids-MSKCC - Exome profiling of prostate
- Prostate-Adenocarcinoma-Sun-Lab - Whole-genome and Transcriptome
- Prostate-Adenocarcinoma-TCGA-PanCancer-Atlas - Comprehensive TCGA
- Prostate-Adenocarcinoma-TCGA - Integrated profiling of 333 primary
- Prostate-Diagnosis - PCa T1- and T2-weighted magnetic resonance images
- Prostate-Fused-MRI-Pathology - The Prostate Fused-MRI-Pathology
- Prostate-MRI - The Prostate-MRI collection of prostate Magnetic Resonance
- Prostate-R - The popular statistical package R contains a prostate cancer
- QIN-PROSTATE-Repeatability - The QIN-PROSTATE-Repeatability dataset
- QIN-PROSTATE - The QIN PROSTATE collection of the Quantitative Imaging
- SEER-YR1973_2015.SEER9 - The SEER November 2017 Research Data files
- SEER-YR1992_2015.SJ_LA_RG_AK - The SEER November 2017 Research Data files
- SEER-YR2000_2015.CA_KY_LO_NJ_GA - The SEER November 2017 Research Data
- SEER-YR2000_2015.CA_KY_LO_NJ_GA - The July - December 2005 diagnoses
- TCGA-PRAD-US - TCGA Prostate Adenocarcinoma (499 samples).
- Amazon
- Archive.org Datasets
- Archive-it from Internet Archive
- CMU JASA data archive
- CMU StatLab collections
- Data.World
- Data360
- Enigma Public
- Grand Comics Database - The Grand Comics Database (GCD) is a nonprofit,
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Microsoft Data Science for Research
- Microsoft Research Open Data
- Numbray
- Open Library Data Dumps
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- StatSci.org
- Stats4Stem R data sets (archived)
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Shodan - the world's first search engine for Internet-connected devices.
- Censys - the most reputable, exhaustive, and up-to-date source of Internet scan data in the world, so you see everything.
- Academic Torrents of data sharing from UMB
- DataMarket (Qlik)
- Datahub.io
- Domains Project - Sorted list of Internet domains
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Institute of Education Sciences
- National Technical Reports Library
- Open Data Certificates (beta)
- OpenDataNetwork - A search engine of all Socrata powered data portals
- Statista.com - statistics and Studies
- Zenodo - An open dependable home for the long-tail of science
- Dryad - a curated resource that makes research data discoverable, freely reusable, and citable. Dryad provides a general-purpose home for a wide diversity of data types
- OSF - OSF is a free, open platform to support your research and enable collaboration
- Statcrunch - Access tens of thousands of datasets, perform complex analyses, and generate compelling reports
- TensorFlow Datasets - a collection of ready-to-use datasets
- Mendeley
- Google dataset research
- Google public data
- data.world is the modern catalog for data and analysis
- Harvard Dataverse - a repository for research data
- UC Irvine Machine Learning Repository
- appen - Datasets curated on the Appen platform
- Kaggle Datasets
- Registry of Open Data on AWS
- scale.com - Index of Open Datasets for Computer Vision and Natural Language Processing
- Dataverse - Open source research data repository software
- Open Data Kit
- CKAN - the world’s leading Open Source data portal platform
- Open Data Monitor
- Plenar.io
- OSINT framework OSINT framework focused on gathering information from free tools or resources.
- 72 hours #gamergate Twitter Scrape
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape
- A Twitter Dataset of 40+ million tweets related to COVID-19 - Due to the
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare from UMN/Sarwat (2013)
- GitHub Collaboration Archive
- Google Scholar citation relations
- High-Resolution Contact Networks from Wearable Sensors
- Indie Map: social graph and crawl of top IndieWeb sites
- Mobile Social Networks from UMASS
- Network Twitter Data
- Reddit Comments
- Skytrax' Air Travel Reviews Dataset
- Social Twitter Data
- SourceForge.net Research Data
- Twitter Data for Online Reputation Management
- Twitter Data for Sentiment Analysis
- Twitter Graph of entire Twitter site
- Twitter Scrape Calufa May 2011
- UNIMI/LAW Social Network Datasets
- United States Congress Twitter Data - Daily datasets with tweets of 1100+
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- Dataset of top posts from reddit.
- fivethirtyeight.com
- ACLED (Armed Conflict Location & Event Data Project)
- Canadian Legal Information Institute
- Center for Systemic Peace Datasets - Conflict Trends, Polities, State Fragility, etc
- Correlates of War Project
- Cryptome Conspiracy Theory Items
- Datacards
- European Social Survey
- FBI Hate Crime 2013 - aggregated data
- Fragile States Index
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- German Social Survey
- Global Religious Futures Project
- Gun Violence Data - A comprehensive, accessible database that contains
- Humanitarian Data Exchange
- INFORM Index for Risk Management
- Institute for Demographic Studies
- International Networks Archive
- International Social Survey Program ISSP
- International Studies Compendium Project
- James McGuire Cross National Data
- MIT Reality Mining Dataset
- MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
- Microsoft Academic Knowledge Graph - The Microsoft Academic Knowledge
- Minnesota Population Center
- Notre Dame Global Adaptation Index (ND-GAIN)
- Open Crime and Policing Data in England, Wales and Northern Ireland
- OpenSanctions - A global database of persons and companies of political,
- Paul Hensel General International Data Page
- PewResearch Internet Survey Project
- PewResearch Society Data Collection
- Political Polarity Data
- StackExchange Data Explorer
- Terrorism Research and Analysis Consortium
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set
- UCB's Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UN Civil Society Database
- UPJOHN for Labor Employment Research
- Universities Worldwide
- Uppsala Conflict Data Program
- World Bank Open Data
- WorldPop project - Worldwide human population distributions
- FLOSSmole data about free, libre, and open source software development
- GHTorrent - Scalable, queriable, offline mirror of data offered through
- Libraries.io Open Source Repository and Dependency Metadata
- Public Git Archive - a Big Code dataset for all – dataset of 182,014 top-
- Code duplicates - 2k Java file and 600 Java function pairs labeled as
- Commit messages - 1.3 billion GitHub commit messages till March 2019
- Pull Request review comments - 25.3 million GitHub PR review comments
- Source Code Identifiers - 41.7 million distinct splittable identifiers
- American Ninja Warrior Obstacles - Contains every obstacle in the history
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman's Baseball Database
- Pinhooker: Thoroughbred Bloodstock Sale Data
- Pro Kabadi season 1 to 7 - Pro Kabadi League is a professional-level
- Retrosheet Baseball Statistics
- Tennis database of rankings, results, and stats for ATP
- Tennis database of rankings, results, and stats for WTA
- StatsBomb Open Data - Free football data
- 3W dataset - To the best of its authors' knowledge, this is the first
- Databanks International Cross National Time Series Data Archive
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- Turing Change Point Dataset - Contains 42 annotated time series collected
- UC Riverside Time Series Dataset
- Airlines OD Data 1987-2008
- Ford GoBike Data (formerly Bay Area Bike Share Data)
- Bike Share Systems (BSS) collection
- Dutch Traffic Information
- GeoLife GPS Trajectory from Microsoft Research
- German train system by Deutsche Bahn
- Hubway Million Rides in MA
- Montreal BIXI Bike Share
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- Open Traffic collection
- OpenFlights - airport, airline and route data
- Philadelphia Bike Share Stations (JSON)
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- [Renfe (Spanish National Railway Network) dataset <data.renfe.com)
- Toronto Bike Share Stations (JSON and GBFS files)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- U.S. National Highway Traffic Safety Administration - Fatalities since
- Data Packaged Core Datasets
- Database of Scientific Code Contributions
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- OpenDataMonitor: An overview of available open data resources in Europe
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- Opendata resources in Russian
- Links to awesome open datasets
- awesome-opendata - A curated list of awesome Open Data resources, tools and other awesomeness
- 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
- Datasets for Data Science and Machine Learning
- The Open Data Barometer - A global measure of how governments are publishing and using open data for accountability, innovation and social impact
- 11 websites to find free, interesting datasets
- DeepDive Open Datasets
- Big Data: 33 Brilliant And Free Data Sources Anyone Can Use
- 13 Open Source Datasets for Machine Learning
- The Best Public Datasets for Machine Learning and Data Science
- Open Datasets
- Top 10 Open Dataset Resources on Github
If you wish to contribute to this list, just fork, make your changes and send me a pull request, I'll be happy to review all of your suggestions :)