Useful Links

As I'm working through my own self-powered education in data science I'll post useful links here with thoughts on the content and notes on my own progress.

Basics

Math
  • Linear Algebra - intro/refresher (PDF)
  • Linear Algebra - CS Applications
  • Forecasting
  • Bayes Theorem
  • Statistics
  • Probability and Stats for Programmers

CS Fundamentals
Maybe you have a mac. Lucky you. I have a PC and decided to dual boot Ubuntu. It's another learning curve but has been worth it. Develop in Windows at your own peril (jk... kind of).
  • Introduction to the Command Line
  • CL Shortcuts
  • Linux / Ubuntu
  • Introduction to GitHub

Python
R and Python are both sufficiently mature and thriving to support your data science ambitions. Ideally you'll be familiar and fluent in both as each has its advantages. I'm ignoring that advice, however, and focusing on Python right now.
  • Introduction
  • Anaconda
  • iPython - A browser-based IDE.  Comes with Anaconda 
  • Pandas - For R users, this is where you'll find dataframes

Databases and SQL
  • SQL
  • Databases - MySQL
  • Databases - PostGreSQL

Website Interaction
  • Retrieving data from URLs
  • Retrieving files from URLs
  • APIs - GET, POST, PUT, DELETE
  • Writing a scraper
  • XML/JSON
  • Reading text data from PDFs

Machine Learning

Introductions/Opinion
  • Open source data science masters
  • A few useful things to know about Machine Learning (PDF)
  • Machine Learning on Coursera (loved this course)
  • Machine Learning by Tom Mitchell (PDF, book)

Data Assessment and Cleansing
  • Data wrangling and exploration
  • Working with dates
  • Removing records based on missing/incomplete data
  • Working with subsamples of datasets
  • Converting dataframe into array

Data Mining and Text Analysis
  • Text Analysis - Regular Expressions
  • Natural Language Processing
  • Text Analysis - Latent Dirichlet Allocation
  • Python NL Toolkit (NLTK)
  • Principal Component Analysis

Visualization - Basic
In Python, pandas has some basic plotting functionality and is a good place to start when working with a dataset. Matplotlib goes beyond this, allowing for some R-worthy segmentation.
  • pandas
  • Histograms
  • Scatter Plots
  • Scatter Matrix
  • Trellis Plots

Visualization - Maps & Interactive
  • GIS
  • GeoJSON
  • Chloropleth maps
  • D3.js
  • Interactive maps with D3 and Leaflet

Visualization - Advanced
  • Django Development

Machine Learning - Supervised
  • Linear Regression - Gradient Descent
  • Linear Regression - Normal formula (Lin Algebraic solution)
  • Logistic Regression
  • Time Series Analysis
  • Decision Trees and Random Forests
  • Naive Bayes Classification
  • TF - IDF

Machine Learning - Unsupervised
  • K Means
  • K Nearest Neighbors
  • Collaborative Filtering
  • Restricted Boltzmann Machines
  • Hidden Markov Models
  • Neural Networks
  • Image recognition and analysis

Big Data

Hadoop
  • HDFS
  • MapReduce
  • Hive
  • Pig
  • Impala
  • Spark (Apache)

NoSQL databases
  • Cassandra (Apache)
  • MongoDB

AWS (Amazon Web Services)
  • EC2 (Elastic Compute Cloud)
  • Amazon Redshift


Powered by Create your own unique website with customizable templates.
  James Beveridge
  • Blog
  • Data Science
    • Projects
    • Datasets
    • Useful Links
    • Notes
    • Books
  • Travel
  • Contact
  • Blog
  • Data Science
    • Projects
    • Datasets
    • Useful Links
    • Notes
    • Books
  • Travel
  • Contact