Friday, May 23, 2014

IPython Notebooks for StatLearning Exercises


Earlier this year, I attended the StatLearning: Statistical Learning course, a free online course taught by Stanford University professors Trevor Hastie and Rob Tibshirani. They are also the authors of The Elements of Statistical Learning (ESL) and co-authors of its less math-heavy sibling: An Introduction to Statistical Learning (ISL). The course was based on the ISL book. Each week's videos were accompanied by some hands-on exercises in R.

I personally find it easier to work with Python than R. R seems to have grown organically with very little central oversight, so function and package names are often non-intuitive, and often have duplicate or overlapping functionality. In general, an educated guess about an R function has about the same likelihood of being right as a completely random one - unless you know the function or package, your chances are 50-50. On the other hand, with Python, an educated guess has a 40-90 percent chance of being right, depending on the library and how educated your guess was. So while the good profs were patiently explaining the R code, I was mostly busy fantasizing about writing all of it in Python some day.

At the time, I had worked a bit with scikit-learn and NumPy. I had heard about Pandas and knew it was the Python implementation of DataFrames, but hadn't actually worked with it. Over the past couple of months, I have had the opportunity to work with Pandas and IPython Notebooks for a project I did with my kids, and as a result I now quite enjoy the power and expressivity that these libraries provide.

So I decided to apply my newly acquired skills to do this rewrite. One of my incentives for doing this was the chance to get a fairly comprehensive guided tour of scikit-learn algorithms that I wouldn't normally use. Of course, the tour depends a lot on the guide, and the course is taught from the point of view of a statistician than a machine learning person. Since my toolchain (scikit-learn, NumPy, SciPy, Pandas, MatplotLib and a bit of statsmodels) is more focused towards Machine Learning, there were times when I wasn't able to replicate the functionality completely and accurately.

There are 9 notebooks listed below, corresponding to the exercises for Chapters 2-10 of the course. The notebooks and data can be found on my GitHub in the project statlearning-notebooks. You can also read the notebooks directly on the nbviewer.ipython.org via the links in the README.md file.


This exercise introduced me to a lot of scikit-learn algorithms that I had not used before. Since there are quite a few functionality mismatches between R and scikit-learn, trying to match it often led me to novel ideas described on sites like StackOverflow and Cross-Validated, some of which I implemented (and others I have linked to). I also learned quite a bit about plotting with matplotlib, since the original exercises use R's rich plotting features as a matter of course, some of which require additional work in Python.

Overall, I found that the group of Python libraries were more than adequate for most tasks in the exercises, and (at least in my eyes) resulted in cleaner, more readable code. Take a look at these pages to get an overview of what scikit-learn and Pandas, my two top level libraries, can do. However, R also offers lots of functionality - there is lot of overlap, but in some cases R provides algorithms that scikit-learn doesn't. However, scikit-learn has many more algorithms compared to R. So it makes sense to learn and use both as needed.

If you are considering using my group of Python libraries for data analysis, then the notebooks should be useful as examples. For more advanced programmers, if you think there are better ways to do something than what I have done, I would appreciate hearing from you (or since its on GitHub, a pull request would be good too!).

10 comments (moderated to prevent spam):

Unknown said...

I'd love to contribute, but this project deserves its own Github repository!

Sujit Pal said...

Thanks Ted, this was just slightly little more than a week's worth of spare time hacking, didn't think it deserved another little repository. But if enough people want it, I can move this into its own...

Pronojit said...

Kudos to you! Was thinking of doing it myself. Great to see your implementation.

Robert said...

Great post! I had the same feeling when going through the course-- that I wished the examples were in Python rather than R.
Thanks for sharing your work.
Like Ted, I'd love if you created a git repo for these notebooks.
Cheers-

Sujit Pal said...

Thanks for the kind words. Based on popular demand (more popular than my decision anyway :-)), I have created a new project statlearning-notebooks. I have updated the links in the post (and in the README.md file on the project) as well.

Unknown said...

Thankfully the authors relegated all the examples to the end of the chapter which makes it easier to know which parts to translate from R to Python and does not get in the way of learning the theory.

fastzhong said...

Thanks, useful work.

Sujit Pal said...

You are welcome fastzhong.

Unknown said...

Great work!

Sujit Pal said...

Thanks Yakattack, glad you liked it :-).