Skip to main content

Data Science

What are Datasets?

Datasets are collections of raw data gathered during the research process usually in the form of numerical data. Many organizations, e.g. government agencies, universities or research institutions make the data they have collected freely available on the web for other researchers to use.

Note: Data is the raw information from which statistics are created. Statistics give an interpretation of the data.

How to Identify Relevant Datasets?

To identify relevant datasets for use in your research, you can:

  • Search for articles in CityU LibraryFind using your topic keywords and include the terms dataset OR "data set" in the search.
  • Search the website or publications of an organization or government department that collects the type of data that you need.
  • Try searching through a large data archive. 

Finding Datasets & Data Sources

  • Google Dataset Search
    Google Dataset Search was first launched in September 2018, which allows you to search for millions of public datasets across the Web. Simply enter your keywords and a list of published datasets will be retrieved with the name of the dataset provider.
  • Google Public Data Explorer
    Google Public Data Explorer includes high-quality datasets from providers such as the World Bank, Eurostat, OECD, etc. It is also a visualization tool that makes large, public-interest datasets easy to explore, visualize, and communicate. You can navigate between different views to make customized comparisons.
  • DATA.GOV.HK
    This is a public sector information portal that allows you to find various data in Hong Kong. You can download geospatial data for commercial app development, personal analysis or academic study freely.
  • Dryad
    Dryad is an open-source, community-driven project that takes a unique approach to data publication and digital preservation. It focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated.
  • dataZoa
    You can access over 3 billion of datasets from public sites and publishers via dataZoa, which includes data on economics, demographics, energy, finance, health, etc.
  • re3data.org  
    Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It includes repositories that enable permanent storage of and access to data sets to researchers, funding bodies, publishers, and scholarly institutions.
  • Harvard Dataverse 
    Harvard Dataverse is a free data repository open to all researchers from any discipline. Researchers can share, archive, cite, access, and explore research data.
  • DataCite 
    DataCite provides persistent identifiers (DOIs) for research data and other research outputs to make them more discoverable.
  • UCI Machine Learning Repository
    UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
  • Data.gov (U.S.) 
    Managed and hosted by the U.S. General Services Administration, Technology Transformation Service, Data.gov is primarily a federal open government data site. It does not host data directly, but rather aggregates metadata about open data resources in one centralized location.
  • Eurostat
    Eurostat provides direct and free of charge online access to all of Eurostat's statistical databases and electronic publications. It is updated twice a day and covers data for the European Union (EU), the EU Member States, the euro area, Candidate countries, and EFTA countries.
  • data.world
    data.world is a catalog for data and analysis which is free and open to the public.
  • Kaggle
    Kaggle covers over 19,000 public datasets and 200,000 public notebooks for users to conquer any analysis. You can search and publish datasets, explore and build models in a web-based data-science environment.
  • Wharton Research Data Services (WRDS) 
    WRDS is a Library subscribed database which provides business intelligence, data analytics, and research platform to global institutions. These are the available datasets:
    • Audit Analytics
    • Compustat Executive Compensation
    • Compustat Global
    • Compustat North America
    • CRSP/Compustat Merged Database
    • Institutional Brokers Estimates System (IBES)
    • RiskMetrics (formerly IRRC) Governance and Directors
    • Thomson Reuters Institutional (13f) Holdings