Attend FREE Webinar on Digital Marketing for Career & Business Growth Register Now

Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...

Best Places to Find Data Sets for Data Science Projects

    -  

5 (100%) 3 votes

Getting information off the Internet is like taking a drink from a firehose.

If you are working on a Data Science project or you have worked on one, chances are you have spent plenty amount of time exploring interesting and useful datasets to analyze. It is fun to sit and explore different datasets until you realize that the data is not really useful after all. Thankfully, there are online resources that curate datasets and some of them even remove the not useful ones.

Do you need enormous amount of data for your project? Do you just want to make infographic with interesting fact and looking for data? Do you simply want to know what kind of data is available out there on the internet? If you have any of these questions, you’re in just right place. In this post, we’ll briefly talk about various Data Science projects. Which will include machine learning, data visualization, data mining and data cleaning projects. And mainly, we will walk through where you can find datasets related to your project.

What you can make from it and How!

Sky is the only limit is true in this scenario. The usage of these datasets is directly proportional to your creativity.

The most basic and simple way to use data is to create Stories- Data stories in this case- and publish them over the web. Tableau is a sturdy platform to start with. Or, if you are working on your own product, these datasets can power your product by providing new insights with new input data. Well, that was another way to look at it. There are hundreds of other application which you can apply these data to.

Let’s start with Machine Learning

Predicting and Classifying is one step in machine learning which one should perform efficiently. In order to do that, one need to make sure that-

  • Dataset must be clean and not too messy. Because, of course it will take more time in cleaning rather than doing actual predicting.
  • Columns must be relatable and one interesting column should be there to make predictions for.

There are online repositories of datasets that are specifically curated for machine learning. Those datasets are generally cleaned up in advance so that you can test algorithms swiftly.

  1. Kaggle is an online Data Science community which hosts competitions. There are distinct datasets available on the website for both live and historical competitions. You can download either one after signing up which is free.  Apart from Kaggle’s competition datasets, there’s an entire section of datasets where you can find datasets uploaded by users. Interesting. Isn’t it?Example: The classic Titanic challenge on kaggle is something from begineers take off.
  2. UCI Machine Learning Repository
    UCI Machine Learning Repository is always a great first stop while looking for interesting datasets. Also, it is one of the most important and oldest source of datasets. Even though all the datasets are user contributed, they have impressive levels of documentation and cleanliness. Vast majority of the datasets are ready for machine learning to be applied.Exapmle:Email Spam – http://mlr.cs.umass.edu/ml/datasets/SpambaseSolar flares – http://mlr.cs.umass.edu/ml/datasets/Solar+Flare
  3. Quandl 

Quandl is a repo of economic and financial data available for free as well as have datasets which require purchase. As large amount of dataset for statistic is available on stock prices, it is possible to build complex machine learning models that can predict price or at least price trend. 

    1. Examples:

      1. Chinese economic health – https://www.quandl.com/search?query=NBSC
      2. US economic indicators – https://www.quandl.com/search?query=FRED 

    Data Analytics Course by Digital Vidya

    Free Data Analytics Webinar

    Date: 24th May, 2018 (Thursday)
    Time: 3 PM to 4 PM (IST/GMT +5:30)

    Data Set for Recommendation Engine

    Recommendation systems are important part in our day-to-day life and also in Machine learning. There are several categories in which the recommendation systems are classified. Let’s talk about where to find data for each category.

    1. Amazon Product Data
      1. Amazon Product data link – http://jmcauley.ucsd.edu/data/amazon/links.html
      2. SNAP – https://snap.stanford.edu/data/web-Amazon.html
    2. Movies Recommendation
      1. MovieLens- http://www.grouplens.org/node/73
      2. Yahoo! – Movie, Music and Image Ratings datasets – http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
      3. Netflix Prize Dataset – http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
      4. Cornell University – Movie-review dataset for data mining and sentiment analysis Experiments – http://www.cs.cornell.edu/people/pabo/movie-review-data/

    Data for Data processing projecs

    1. Yahoo Sandbox Dataset – Data for Advertising and Market, Computing Systems, Competition, Graph and Image, Social, Rating and Classification.
    2. Amazon Web Services – It provides centralized repository of public datasets which can be seamlessly integrated into AWS cloud-based applications.
    3. Youtube 8M – A Large and Diverse Labeled video dataset
    4. Google BigQuery Public Datasets – Not so useful to beginners, yet, Google provides dataset which is stored in BigQuery and made available to the general public.
    5. Microsoft Research

    Data for Data Visualization Projects

    Visualization is by far the most efficient way to represent your insights on data. It can be “I want to make an infographic about monthly consumption of chocolates in the US” or “ how income varies across the different states in the US”. There are some things you should consider when looking for a good set for visualization project.

    1. It should be interesting enough to make charts about
    2. Dataset must be well explained so the visualization is almost accurate
    3. It should not be complex

    News websites are always a good place to go for visualization datasets as it is cleaned and ready to use. However, there are other sites where you can find data and also have charts they’ve already made and improve it.

    1. FiveThirtyEight – It is a widely popular interactive news website and sports site. The write interesting data-driven articles. FiveThirtyEight makes datasets available online on Github.
    2. Scorta OpenDataIt is an online portal that contains clean datasets that is available to explore through browser. Also, it is available for download to visualize. There is no need to register to download. Most of the data is from US government sources and many are outdated.

    Apart from these data, there are many datasets available online that may help in data science projects.

    1. Jerry Smith Data Collection – List of Data from different domains which includes data from and of universities, Social Sciences, Science, Finance, Machine Learning, Economics, Public domains, Government and many more.
    2. StatLib – The Carnegie Mellon University’s dataset Archive which contains data about employment, cars, bank research, baseball, colleges, healthcare, and a lot more.
    3. gov – Explore Education datasets, applications, and resources for classroom.
    4. net – It is owned and operated by OSTG, INC. and includes historic and status statistics on approximately 140,000 projects and over 1.5 million registered users’ activities at the project management web site.
    5. RBI – Reserve bank of India release data on daily, weekly, fortnightly, monthly, quarterly, annually and also occasionally.
    6. Robert Shiller’s Data collection of Stock Market – Stock Market data used in book, Irrational Exuberance [Princeton University Press 2000, Broadway Books 2001, 2nd, 2005] is available for download from this site.
    7. Open Data Census – Census data from around the globe
    8. Open Source Sports – Data about baseball, football, hockey, basketball and many more.
    9. National Space Science Data Center – Data for space exploration, planetary and astrophysics exploration and many more.

    Bonus

    data.world – It is a platform where data scientists can find and use a vast array of high-quality open data, meet other like-minded data nerds, and collaborate on data projects.

    End Notes

    The applications which you can create from these datasets is not limited. I hope the datasets which I have provided will help you to implement your imagination in applications. Let us know in the comments if you’ve find these  datasets useful or you require any other kind of data. I will try to help you at my best.

    Data Science aspirant

  1. Big Data


  2. There is 1 comment


    • 2 months ago

      Zebaa   /   Reply

      Hey Dhrumil! Great job here.

    Your Comment

    Your email address will not be published.