Reading data from Ken French's website using Python
We go through a quick tutorial on using pandas.read_csv
and the pandas_datareader
specifically for downloading data from Ken French's website. We will extract the following datasets
- 10 US industry data of average value-weighted monthly returns.
- 5 Fama-French risk factor monthly returns.
Import the necessary modules for file management and change the working directories accordingly
from pathlib import Path
import sys
import os
home = str(Path.home())
if sys.platform == 'linux':
inputDir = '/datasets/indices/'
elif sys.platform == 'win32':
inputDir = '\\datasets\indices\\'
fullDir = home+inputDir
os.chdir(fullDir)
Reading from CSVs downloaded from Ken French's website¶
import pandas as pd
As the dates from the Ken French website CSVs are in the format 201008
, we need to write a function to parse this data. We show that parsing a input string like 192607
is parsed correctly.
dateparse = lambda x: pd.datetime.strptime(x,'%Y%m')
dateparse('192607')
Use read_csv
from pandas
- Skip the first 11 rows as a header
- When reading in the csv, parse dates
- Use the
dataparse
lambda function that was created
Read the full timeseries of industry and risk factor data.
readDir = fullDir
readFile = readDir + 'FF_10_Industry_Portfolios.CSV'
df_10indus_m = pd.read_csv(readFile,skiprows=11,nrows=1107,index_col=0,parse_dates=True,date_parser=dateparse)
readFile = readDir + 'FF_Research_Data_5_Factors_2x3.CSV'
df_5fac_m = pd.read_csv(readFile,skiprows=3,nrows=663,index_col=0,parse_dates=True,date_parser=dateparse)
Select the dates relevant for DeMiguel et al. (2009)
dfAsset = df_10indus_m.loc['1963-07-01':'2004-11-01']
dfFac = df_5fac_m.loc['1963-07-01':'2004-11-01']
df_10indus_m.head()
dfFac.head()
Reading directly from Ken French's website¶
import pandas_datareader.data as web # module for reading datasets directly from the web
from pandas_datareader.famafrench import get_available_datasets
import pickleshare
We extract all the available datasets from Ken French's website and find that there are 286 of them. We can opt to see all the datasets available.
datasets = get_available_datasets()
print('No. of datasets:{0}'.format(len(datasets)))
#datasets # comment out if you want to see all the datasets
US Industry dataset¶
We are looking for a dataset of US 10 industries, thus use the keywords '10' and 'industry' to find out what the names of the relevant datasets.
df_10_industry = [dataset for dataset in datasets if '10' in dataset and 'Industry' in dataset]
print(df_10_industry)
We select 10_Industry_Portfolios
from July 1963 to November 2004 (as per DeMiguel et al., 2009) If you do not have start or end dates, the default will extract portfolios from 2010 to the latest datapoints available
ds_industry = web.DataReader(df_10_industry[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting '10_Industry_Portfolios'
Obtaining data from the datareader returns a dict
. Thus we want to see what is inside the dict.
print(type(ds_industry))
ds_industry.keys()
We find that there are keys from 0-7, and also a DESCR
. We read the contents of DESCR
and find that it explains what dataset each key from 0-7 corresponds to.
print(ds_industry['DESCR'])
DeMiguel et al. (2009), use average value-weighted returns, thus we will use dataset 0
.
ds_industry[0].head()
Risk factor dataset¶
We perform the same process above Fama-French risk factor portfolio dataset
df_5_factor = [dataset for dataset in datasets if '5' in dataset and 'Factor' in dataset]
print(df_5_factor)
ds_factors = web.DataReader(df_5_factor[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting 1F-F-Research_Data_Factors_2x3')
print('\nKEYS\n{0}'.format(ds_factors.keys()))
print('DATASET DESCRIPTION \n {0}'.format(ds_factors['DESCR']))
ds_factors[0].head()
We create copies of the industry and risk factor returns that we read from Ken French's website into dfAsset
and dfFactor
respectively.
dfAsset = ds_industry[0].copy()/100
dfFactor = ds_factors[0].copy()/100
We create excess returns by subtracting the risk-free rate from the asset returns
dfXsAsset = dfAsset.sub(dfFactor['RF'],axis=0)
dfXsAsset.head()
Pickling data¶
We pickle
our files that now can be retrieved from other notebooks.
storeDir = fullDir+'/pickleshare'
db = pickleshare.PickleShareDB(storeDir)
db['dfXsAsset'] = dfXsAsset
db['dfFactor'] = dfFactor
Comments
Comments powered by Disqus