I am currently enrolled in an applied statistics program. For my course I really want to use the IPython Notebook. It allows me to do the calculations and show formulas all in one document. I have been trying figure out how to use Python instead of a statistics program. The functionality is all there, but is spread across different libraries, and there are many choices when it comes to graphing. In this post I will show to do some simple statistics (mostly related to linear regression) using Python.
Requirements
- numpy
- scipy
- pandas
- statsmodels
- matplotlib
- seaborn
Getting data into Python
Data is typically in some column-based format (either in Excel, or a tab/comma delimeted file). CSV is the easiest to import, so use Excel to export whatever you have into CSV. Then Pandas has a method to quickly get data from a CSV file into Python (into a Pandas DataFrame object). Pandas provides a lot of functionality, but at its core is the DataFrame and Series object.
from pandas import DataFrame
data = DataFrame.from_csv("filename.csv", index_col=None))
The DataFrame is made up of Series objects (one for each column of your data). You can easily get to a Series by indexing into the DataFrame.
data["Column1"]
The columns are basically Numpy arrays- you can easily apply an operation to every value in the column.
Simple scatter plot
One of the frustrating things about getting up to speed with performing statistics in Python is that there is often more than one way to accomplish something. For example, Pandas allows you to easily create a scatter plot of two variables, but you can also do this directly through Matplotlib. In general, in my limited experience, if you can avoid dropping down to matplotlib, then do so. People have spent a lot of time worrying about the details so you don't need to. Unfortunately, it is likely you will need to learn it at some point to get exactly what you want.
With pandas
data.plot(x="Column1", y="Columns2",kind="scatter")
Matplotlib
column1 = data["Column1"]
column2 = data["Column2"]
fig = plt.figure()
axes = fig.add_axes([0, 0,.8, .8])
axes.scatter(column1, column2, marker = "o")
There are quite a few ways to accomplish the same thing in matplotlib. This post is a decent place to get started, although it uses the Matlab style interface which I find gets confusing very fast. The alternative (and the one I used above) is the object oriented mode where you instantiate individual figure objects and operate on them instead of using a implicit global figure object. This link does a good job showing examples of both styles: .
A tip- seaborn, which is a library for statistics visualization that uses matplotlib, will set some visual defaults for matplotlib as soon as you import it. Try importing and regenerating one of the graphs above to see the difference.
import seaborn as sns
Simple Linear Regression
Another example, you can do simple linear regression using the statsmodels library, but Pandas wraps that functionality up making it a little easier to perform when you are working with a DataFrame object.
Directly through Pandas DataFrame
from pandas.stats.api import ols
ols(x=data["Column1"], y=data["Column2]))
Statsmodels
import statsmodels.api as sm
X = sm.add_constant(data["Column1"])
model = sm.OLS(data["Column2"], X)
model.fit().summary()
You can also use the statsmodel formula api to do the same thing
model = ols('Column2 ~ Column1', data=data)
Residuals vs Fits
Generating a graph of residuals versus the fitted y values has come up a few times. I searched for a library that does this out of the box, but could not find it. Seaborn comes close with:
import seaborn as sns
sns.residplot("Column1", "Column2", data=data)
But unfortunately this creates a plot of residuals vs the x values. I think in many situations, this graph will tell you the same thing as residuals versus fitted values, but it is not what I was looking for. Here is how to create it with vanilla matplotlib:
fitted = 10 + 0.7*data["Column1"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1,0])
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, data["Column2"] - fitted, marker = "o")
ANOVA for linear regression
Not much to say here.
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
model = ols('Column2 ~ Column1', data=data)
anova_lm(model.fit())
Thanks for reading, please leave any comments or corrections.