Open Source Data Science Master – The Plan

Free!! education platforms have put some of the world’s most prestigious courses online in the last few years. This is our plan to use these and create our own custom open source data science Master. Quickly before we begin though, in the spirit of openness we should explain where we are starting from:

We both have Physics degrees and are comfortable with maths, logic, algorithms and manipulating data. But perhaps most importantly we enjoy this type of work. The program we have designed does not require any pre-conceived knowledge in the topics below; however we do feel it is an advantage to come from a numerical background.

“So what does it take to become a data scientist?

Statistics

Statistics is perhaps the start point of data science. For most questions in the world we have neither measured every phenomenon nor asked every person what they think, instead we have a small recorded subset of conversations and measurements. Statistics helps us understand what we can, and as importantly cannot reasonably learn from that smaller group.

Visualisation

There is no such thing as information overload. There is only bad design.” Edward Tufte

There is no point having a story to tell but being unable to tell it. With the number of new visualisation tools springing up each year there is no excuse not to make your story beautiful and compelling. This involves elements of design and artistic principles, not things you pick up in your average Physics warehouse.

Programming

As far as I understand programming is making the computer do what you can’t be bothered or do not have a long enough life span to do yourself. Most analysis is crunched by programs and now most beautiful data visuals are drawn by them. Although we had done basic programming before: simple loops and stringing together conditional statements, we needed programming as the glue that tied everything else together. As a data scientist, you could probably get away with a certain level of R or python but you’d be reliant on your back-end developers to retrieve/ manipulate data and front end developers to showcase it for you ! Limiting huh ?

Data Manipulation

Having both worked as data scientists before leaving for Thailand, we quickly understood that the majority of Data Science is actually finding, cleaning and reformatting data. Although it doesn’t sound exciting, a thorough understanding of current data formats, querying databases and building interfaces for your data models will allow you to work well in a team and more importantly actually leave the office on time!

Machine Learning & Algorithms

This topic is broad, with a masters worth of active research in numerous fields and areas. However it is also where our motivation to become data scientists came from: building self driving cars, identifying people based on their ear lobes :). The common process these all share is the ability to teach a computer to understand patterns like the human brain, whether it is oncoming traffic or youtube cats. We will be studying some of the current most common tools for uncovering patterns, but it is an active field and a life long learning.

The Plan

This is our study: it is divided into two (circa 2 months each) terms and contains the courses we have found to be most suitable for learning each of the above topics. NB: It also contains some time to work on projects which apply what we are learning to mimic the structure of a Masters that would include a final project.

Pre – Requisites & Pre – Read

We completed the courses below whilst still in our corporate jobs, in the evenings and weekends, before fully embarking on this journey and hence will consider these courses as pre-requisites because some of the new courses build upon them.

 – Machine Learning

Computing for Data Analysis

Data Analysis

Programming Methodology: CS106A

Also, as we didn’t have previous Javascript experience, and needed some concepts for the visualisation course, we set reading Eloquent Javascript as a pre-read.

Term 1

Mon Tue Wed Thu Fri Sat
9 – 10 NLP CS106B NLP CS106B CS106B
10 – 11 NLP CS106B NLP CS106B CS106B Spanish1
11 – 12 NLP CS106B NLP CS106B Stats Work Spanish
12 – 13 Lunch Lunch Lunch Lunch Lunch Lunch
13 – 14 NLP Stats NLP Stats Vis
14 – 15 DB Stats DB Stats Vis Catch-up2
15 – 16 DB Stats Work DB Stats Work DB Catch-up
16 – 17 CS106B Vis CS106B Vis DB
17 – 18 CS106B Vis CS106B Vis

Term 2

Mon Tue Wed Thu Fri Sat Sun
9 – 10 CS1693 Ruby CS169 Project4 CS169 Project
10 – 11 CS169 Ruby CS169 Project CS169 Project Spanish
11 – 12 CS169 RoR CS169 Project CS169 Project Spanish
12 – 13 Lunch RoR Lunch Project Lunch Project Lunch
13 – 14 CS169 Lunch RoR Project Lunch Project
14 – 15 API Web Dev5 API Project API Project
15 – 16 Choice6 CS169 Choice Project Choice Project
16 – 17 Choice CS169 Choice Project Choice Project
17 – 18 Choice CS169 Choice Project Choice Project

1 Yep, we thought it would be fun and a useful skill to learn Spanish!
2 Whatever work needs catching up on as and if we fall behind…
3 Part 2 of CS169 is also running on EDX.
4 Project work hasn’t yet been defined as we hope ideas will finalise as we work through the programme.
5 This hour will comprise a mix of learning skills around web development such as Twitter bootstrap.
6 This course will be a choice of the following or another that we decide nearer the time, as our skills and interests get more defined:
– Programming Paradigms CS106C
Probability and Random Variables
Developing IOS 7 Apps for IPhone and IPad
selection from

Advertisements

15 thoughts on “Open Source Data Science Master – The Plan

  1. Pingback: 4 data science examples after 1 month studying | 6 months to becoming a data scientist

  2. Pingback: “So, what does a Data Scientist do?” | Random Walks

  3. Good point, hadn’t thought that ios isn’t exactly open source as well. But that’s a long way away at the moment, working now on Harvard’s Visualisation course, so will be happy if we can get browser based visualisations soon!! Good luck, glad you liked the plan!

    • Thanks for sharing that, it’s a useful diagram; we’ll try to visit as many stations as we can 🙂 The main missing bit is probably around Big Data, hoping CS169 next term will cover some of the basics…

  4. Pingback: Recommended online class | Xuan Tuan Trinh's Blog

  5. Very interesting post !! I am in a similar boat, and your studying schedule was immensely helpful !
    It would be great to find out how you are making out, and what would you differently if you were to start over.

    • Hi Syed, thanks for the comment. Really glad you found the schedule helpful. We are currently working on a project that explains how it all worked out so far which we’ll hopefully share on the blog soon. Also great idea about sharing what we would do differently; the open source world is moving so fast, we already adapted part of our program already !

  6. Hey guys. This is truly inspiring. As someone that lived in Asia for 9 month and travel around South East Asia, I can for sure see why you two picked Thailand as your destination. I completely agree with your point – “the majority of Data Science is actually finding, cleaning and reformatting data”.

    I and currently working on a visual data scraping tool. I was hoping to connect with you two and get to know more about this “hassle” of collecting data. Any of your insights from your learning journey would be amazing. We are really trying to make a tool to make this problem go away.

    Thanks so much 🙂 🙂

    • Hey Angelina,

      Thanks very much for the comment and well done on your web-scraping tool already, we’ve had a look at the tutorial and it looks like you already have a good prototype. Anyway incase it is helpful here are some thoughts we had on what we need in the web scraping and data munging process:

      Data Output Options:
      – Any data we are scraping we are probably wanting to be able to output into a csv, tsv or JSON format

      Application:
      – If the dataset is small then we would probably just copy/paste it from the website ourselves and format it in a text editor
      – If the dataset has the same format but spans several webpages then it would be useful to have a tool which opens each page and extracts the data; currently we would probably use a python script or jquery if the data is being scraped from the DOM
      – The formats which cause particular problems are likely to be geo-data or network data (i.e. node/links) and it would be useful to have a tool that understood these data types and again allowed you to output them as JSON or csv

  7. Pingback: Revised Plan – 6 months to becoming a data scientist | Open Source Data Science Masters

  8. Great. This is so structured. Thanks a lot, for sharing. Do you guys have a github repository where you’ve shared your work?

  9. Pingback: Embarking on the journey. | Going the Data Science way.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s