User:Halfak (WMF)/WMF research libraries
I'd like to perform a substantial upgrade and consolidation of our (WMF's) python code for research in preparation for some dramatic improvements to my analysis/development environment. I'll use this page to document some of those ideas.
Python 3
[edit]Transitioning from 2.7 to 3 is annoying, so I plan to bundle it with a larger transition. I'm also hoping to transition from R (love the community, hate the language) for statistical work too. This transition will rely heavily on support from numpy and scipy.
Python as an analysis environment
[edit]IPython Notebook
[edit]The environment is relatively straightforward. I found myself picking up markdown in a matter of minutes. It's fun to run code and then complain about what happened. There are a few complaints that I have. For example, I have to reach for my mouse to switch from code mode to markdown mode. However, the system mostly just works and it's much smarter and cleaner than an R document. 22:24, 4 November 2013 (UTC)
Pandas for data tables
[edit]I just finished a quick run through the Pandas documentation and checked for some of the functionality that I regularly use in R. I found that most of it was intact, but quite a lot of the transformations and filtering I'd like to do are a little quirky and over-convoluted. I'm finding myself missing data.tables
from R a lot, but I can do what I need to do. 22:24, 4 November 2013 (UTC)
Plotting with Bokeh
[edit]I ran through a little bit of the set of examples for plotting in IPython notebook. It seems like the library is quite capable, but it's not ready. For example, geom_errorbar
, one of my favorite functions, is missing. That's just one example. I think I'll be trying out bokeh another time, but I'm worried that reverting from my awesome R plotting environment will make me less productive. 22:24, 4 November 2013 (UTC)
Map reduce
[edit]Python streaming
[edit]Utilities
[edit]There are two sets of problems that I'd like to solve in a set of utilities.
- Common scripts (see clize)
- A set of utility scripts for extracting information from the database/dumps/etc and performing operation & transformations or gathering stats.
- Common utilities
- A set of python modules that supports extension of these common actions (e.g. XML dump processing scripts & Persistence) or data transformations (e.g.
User stats ~ Wikimetrics
[edit]It's important that any statistics generation/extraction is closely tied to m:Wikimetrics or we'll end up duplicating a bunch of work.