If you are working on a Data Science project or you have worked on one, chances are you have spent plenty amount of time exploring interesting and useful datasets to analyze. It is fun to sit and explore different datasets until you realize that the data is not really useful after all. Thankfully, there are online resources that curate datasets and some of them even remove the not useful ones. Let’s get started with the best places to find data sets for data science projects.
Getting information off the Internet is like taking a drink from a firehose.
Do you need an enormous amount of data for your project? Do you just want to make an infographic with interesting facts and looking for data? Do you simply want to know what kind of data is available out there on the internet? If you have any of these questions, you’re in just the right place. In this post, we’ll briefly talk about various Data Science projects. Which will include machine learning, data visualization, data mining, and data cleaning projects. And mainly, we will walk through where you can find datasets related to your project.
What you can make from it and How!
Sky is the only limit is true in this scenario. The usage of these datasets is directly proportional to your creativity.
The most basic and simple way to use data is to create Stories- Data stories in this case- and publish them over the web. Tableau is a sturdy platform to start with. Or, if you are working on your own product, these datasets can power your product by providing new insights with new input data. Well, that was another way to look at it. There are hundreds of other application which you can apply these data to.
Let’s start with Machine Learning
Predicting and Classifying is one step in machine learning which one should perform efficiently. In order to do that, one needs to make sure that-
- Dataset must be clean and not too messy. Because, of course, it will take more time in cleaning rather than doing actual predicting.
- Columns must be relatable and one interesting column should be there to make predictions for.
There are online repositories of datasets that are specifically curated for machine learning. Those datasets are generally cleaned up in advance so that you can test algorithms swiftly.
- Kaggle is an online Data Science community which hosts competitions. There are distinct datasets available on the website for both live and historical competitions. You can download either one after signing up which is free. Apart from Kaggle’s competition datasets, there’s an entire section of datasets where you can find datasets uploaded by users. Interesting. Isn’t it? Example: The classic Titanic challenge on kaggle is something from beginners take off.
- UCI Machine Learning Repository
UCI Machine Learning Repository is always a great first stop while looking for interesting datasets. Also, it is one of the most important and oldest source of datasets. Even though all the datasets are user-contributed, they have impressive levels of documentation and cleanliness. The vast majority of the datasets are ready for machine learning to be applied. Example: Email Spam – http://mlr.cs.umass.edu/ml/datasets/SpambaseSolar flares – http://mlr.cs.umass.edu/ml/datasets/Solar+Flare
Quandl is a repo of economic and financial data available for free as well as have datasets that require purchase. As large amount of dataset for statistic is available on stock prices, it is possible to build complex machine learning models that can predict price or at least price trend.
- Chinese economic health – https://www.quandl.com/search?query=NBSC
- US economic indicators – https://www.quandl.com/search?query=FRED
Download Detailed Curriculum and Get Complimentary access to Orientation Session
Time: 11:00 AM to 12:30 PM (IST/GMT +5:30)
Data Set for Recommendation Engine
Recommendation systems are important part of our day-to-day life and also in Machine learning. There are several categories in which the recommendation systems are classified. Let’s talk about where to find data for each category.
- Amazon Product Data
- Amazon Product data link – http://jmcauley.ucsd.edu/data/amazon/links.html
- SNAP – https://snap.stanford.edu/data/web-Amazon.html
- Movies Recommendation
- MovieLens- http://www.grouplens.org/node/73
- Yahoo! – Movie, Music and Image Ratings datasets – http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
- Netflix Prize Dataset – http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
- Cornell University – Movie-review dataset for data mining and sentiment analysis Experiments – http://www.cs.cornell.edu/people/pabo/movie-review-data/
Data for Data processing projects
- Yahoo Sandbox Dataset – Data for Advertising and Market, Computing Systems, Competition, Graph and Image, Social, Rating, and Classification.
- Amazon Web Services – It provides a centralized repository of public datasets that can be seamlessly integrated into AWS cloud-based applications.
- Youtube 8M – A Large and Diverse Labeled video dataset
- Google BigQuery Public Datasets – Not so useful to beginners, yet, Google provides dataset which is stored in BigQuery and made available to the general public.
- Microsoft Research
Data for Data Visualization Projects
Visualization is by far the most efficient way to represent your insights on data. It can be “I want to make an infographic about monthly consumption of chocolates in the US” or “ how income varies across the different states in the US”. There are some things you should consider when looking for a good set for a visualization project.
- It should be interesting enough to make charts about
- Dataset must be well explained so the visualization is almost accurate
- It should not be complex
News websites are always a good place to go for visualization datasets as it is cleaned and ready to use. However, there are other sites where you can find data and also have charts they’ve already made and improve it.
- FiveThirtyEight – It is a widely popular interactive news website and sports site. They write interesting data-driven articles. FiveThirtyEight makes datasets available online on Github.
- Scorta OpenData – It is an online portal that contains clean datasets that is available to explore through the browser. Also, it is available for download to visualize. There is no need to register to download. Most of the data is from US government sources and many are outdated.
Apart from these data, there are many datasets available online that may help in data science projects.
- Jerry Smith Data Collection – List of Data from different domains which includes data from and of universities, Social Sciences, Science, Finance, Machine Learning, Economics, Public domains, Government and many more.
- StatLib – The Carnegie Mellon University’s dataset Archive which contains data about employment, cars, bank research, baseball, colleges, healthcare, and a lot more.
- gov – Explore Education datasets, applications, and resources for the classroom.
- net – It is owned and operated by OSTG, INC. and includes historic and status statistics on approximately 140,000 projects and over 1.5 million registered users’ activities at the project management web site.
- RBI – Reserve bank of India release data on daily, weekly, fortnightly, monthly, quarterly, annually and also occasionally.
- Robert Shiller’s Data collection of Stock Market – Stock Market data used in the book, Irrational Exuberance [Princeton University Press 2000, Broadway Books 2001, 2nd, 2005] is available for download from this site.
- Open Data Census – Census data from around the globe
- Open Source Sports – Data about baseball, football, hockey, basketball and many more.
- National Space Science Data Center – Data for space exploration, planetary and astrophysics exploration and many more.
data.world – It is a platform where data scientists can find and use a vast array of high-quality open data, meet other like-minded data nerds, and collaborate on data projects.
The applications which you can create from these datasets is not limited. I hope the datasets which I have provided will help you to implement your imagination in applications. Let us know in the comments if you’ve found these datasets useful or you require any other kind of data. I will try to help you with my best. Grab the Data Science course at digital Vidya to give full optimal attention to the data science projects and its learning.