# Reading data from Ken French's website using Python

We go through a quick tutorial on using pandas.read_csv and the pandas_datareader specifically for downloading data from Ken French's website. We will extract the following datasets

• 10 US industry data of average value-weighted monthly returns.
• 5 Fama-French risk factor monthly returns.

Import the necessary modules for file management and change the working directories accordingly

In [52]:
from pathlib import Path
import sys
import os

home = str(Path.home())

if sys.platform == 'linux':
inputDir = '/datasets/indices/'
elif sys.platform == 'win32':
inputDir = '\\datasets\indices\\'

fullDir = home+inputDir
os.chdir(fullDir)


In [53]:
import pandas as pd


As the dates from the Ken French website CSVs are in the format 201008, we need to write a function to parse this data. We show that parsing a input string like 192607 is parsed correctly.

In [54]:
dateparse = lambda x: pd.datetime.strptime(x,'%Y%m')
dateparse('192607')

Out[54]:
datetime.datetime(1926, 7, 1, 0, 0)

Use read_csv from pandas

• Skip the first 11 rows as a header
• When reading in the csv, parse dates
• Use the dataparse lambda function that was created

Read the full timeseries of industry and risk factor data.

In [55]:
readDir = fullDir



Select the dates relevant for DeMiguel et al. (2009)

In [56]:
dfAsset = df_10indus_m.loc['1963-07-01':'2004-11-01']
dfFac = df_5fac_m.loc['1963-07-01':'2004-11-01']

In [57]:
df_10indus_m.head()

Out[57]:
NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other
1926-07-01 1.45 15.55 4.69 -1.18 2.90 0.83 0.11 1.77 7.04 2.16
1926-08-01 3.97 3.68 2.81 3.47 2.66 2.17 -0.71 4.25 -1.69 4.38
1926-09-01 1.14 4.80 1.15 -3.39 -0.38 2.41 0.21 0.69 2.04 0.29
1926-10-01 -1.24 -8.23 -3.63 -0.78 -4.58 -0.11 -2.29 -0.57 -2.63 -2.85
1926-11-01 5.21 -0.19 4.10 0.01 4.71 1.63 6.43 5.42 3.71 2.11
In [58]:
dfFac.head()

Out[58]:
Mkt-RF SMB HML RMW CMA RF
1963-07-01 -0.39 -0.47 -0.83 0.64 -1.15 0.27
1963-08-01 5.07 -0.78 1.67 0.34 -0.40 0.25
1963-09-01 -1.57 -0.48 0.18 -0.75 0.24 0.27
1963-10-01 2.53 -1.29 -0.10 2.74 -2.24 0.29
1963-11-01 -0.85 -0.84 1.71 -0.44 2.22 0.27

## Reading directly from Ken French's website¶

In [59]:
import pandas_datareader.data as web  # module for reading datasets directly from the web
from pandas_datareader.famafrench import get_available_datasets
import pickleshare


We extract all the available datasets from Ken French's website and find that there are 286 of them. We can opt to see all the datasets available.

In [60]:
datasets = get_available_datasets()
print('No. of datasets:{0}'.format(len(datasets)))
#datasets # comment out if you want to see all the datasets

No. of datasets:286


### US Industry dataset¶

We are looking for a dataset of US 10 industries, thus use the keywords '10' and 'industry' to find out what the names of the relevant datasets.

In [61]:
df_10_industry = [dataset for dataset in datasets if '10' in dataset and 'Industry' in dataset]
print(df_10_industry)

['10_Industry_Portfolios', '10_Industry_Portfolios_Wout_Div', '10_Industry_Portfolios_daily']


We select 10_Industry_Portfolios from July 1963 to November 2004 (as per DeMiguel et al., 2009) If you do not have start or end dates, the default will extract portfolios from 2010 to the latest datapoints available

In [62]:
ds_industry = web.DataReader(df_10_industry[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting '10_Industry_Portfolios'


Obtaining data from the datareader returns a dict. Thus we want to see what is inside the dict.

In [63]:
print(type(ds_industry))
ds_industry.keys()

<class 'dict'>

Out[63]:
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 'DESCR'])

We find that there are keys from 0-7, and also a DESCR. We read the contents of DESCR and find that it explains what dataset each key from 0-7 corresponds to.

In [64]:
print(ds_industry['DESCR'])

10 Industry Portfolios
----------------------

This file was created by CMPT_IND_RETS using the 201812 CRSP database. It contains value- and equal-weighted returns for 10 industry portfolios. The portfolios are constructed at the end of June. The annual returns are from January to December. Missing data are indicated by -99.99 or -999. Copyright 2018 Kenneth R. French

0 : Average Value Weighted Returns -- Monthly (497 rows x 10 cols)
1 : Average Equal Weighted Returns -- Monthly (497 rows x 10 cols)
2 : Average Value Weighted Returns -- Annual (42 rows x 10 cols)
3 : Average Equal Weighted Returns -- Annual (42 rows x 10 cols)
4 : Number of Firms in Portfolios (497 rows x 10 cols)
5 : Average Firm Size (497 rows x 10 cols)
6 : Sum of BE / Sum of ME (42 rows x 10 cols)
7 : Value-Weighted Average of BE/ME (42 rows x 10 cols)


DeMiguel et al. (2009), use average value-weighted returns, thus we will use dataset 0.

In [65]:
ds_industry[0].head()

Out[65]:
NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other
Date
1963-07 -0.49 -0.22 -1.41 2.29 -0.69 -0.23 -1.03 0.56 0.80 -1.61
1963-08 4.89 6.55 6.20 3.93 5.14 4.29 6.43 9.56 4.20 5.49
1963-09 -1.69 -0.24 -0.76 -3.64 0.13 2.36 0.97 -4.06 -2.50 -3.16
1963-10 2.65 9.72 2.58 -0.32 8.29 3.40 0.52 3.38 -0.67 1.38
1963-11 -1.13 -4.84 0.30 -1.15 -0.29 4.16 -1.23 -1.65 -1.02 0.23

### Risk factor dataset¶

We perform the same process above Fama-French risk factor portfolio dataset

In [66]:
df_5_factor = [dataset for dataset in datasets if '5' in dataset and 'Factor' in dataset]
print(df_5_factor)
ds_factors = web.DataReader(df_5_factor[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting 1F-F-Research_Data_Factors_2x3')
print('\nKEYS\n{0}'.format(ds_factors.keys()))
print('DATASET DESCRIPTION \n {0}'.format(ds_factors['DESCR']))

['F-F_Research_Data_5_Factors_2x3', 'F-F_Research_Data_5_Factors_2x3_daily', 'Global_5_Factors', 'Global_5_Factors_Daily', 'Global_ex_US_5_Factors', 'Global_ex_US_5_Factors_Daily', 'Europe_5_Factors', 'Europe_5_Factors_Daily', 'Japan_5_Factors', 'Japan_5_Factors_Daily', 'Asia_Pacific_ex_Japan_5_Factors', 'Asia_Pacific_ex_Japan_5_Factors_Daily', 'North_America_5_Factors', 'North_America_5_Factors_Daily']

KEYS
dict_keys([0, 1, 'DESCR'])
DATASET DESCRIPTION
F-F Research Data 5 Factors 2x3
-------------------------------

This file was created by CMPT_ME_BEME_OP_INV_RETS using the 201812 CRSP database. The 1-month TBill return is from Ibbotson and Associates Inc.

0 : (497 rows x 6 cols)
1 : Annual Factors: January-December (41 rows x 6 cols)

Out[66]:
Mkt-RF SMB HML RMW CMA RF
Date
1963-07 -0.39 -0.47 -0.83 0.66 -1.15 0.27
1963-08 5.07 -0.79 1.67 0.39 -0.40 0.25
1963-09 -1.57 -0.48 0.18 -0.76 0.24 0.27
1963-10 2.53 -1.29 -0.10 2.75 -2.24 0.29
1963-11 -0.85 -0.84 1.71 -0.45 2.22 0.27

We create copies of the industry and risk factor returns that we read from Ken French's website into dfAsset and dfFactor respectively.

In [67]:
dfAsset = ds_industry[0].copy()/100
dfFactor = ds_factors[0].copy()/100


We create excess returns by subtracting the risk-free rate from the asset returns

In [68]:
dfXsAsset = dfAsset.sub(dfFactor['RF'],axis=0)

Out[68]:
NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other
Date
1963-07 -0.0076 -0.0049 -0.0168 0.0202 -0.0096 -0.0050 -0.0130 0.0029 0.0053 -0.0188
1963-08 0.0464 0.0630 0.0595 0.0368 0.0489 0.0404 0.0618 0.0931 0.0395 0.0524
1963-09 -0.0196 -0.0051 -0.0103 -0.0391 -0.0014 0.0209 0.0070 -0.0433 -0.0277 -0.0343
1963-10 0.0236 0.0943 0.0229 -0.0061 0.0800 0.0311 0.0023 0.0309 -0.0096 0.0109
1963-11 -0.0140 -0.0511 0.0003 -0.0142 -0.0056 0.0389 -0.0150 -0.0192 -0.0129 -0.0004

## Pickling data¶

We pickle our files that now can be retrieved from other notebooks.

In [69]:
storeDir = fullDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
db['dfXsAsset'] = dfXsAsset
db['dfFactor'] = dfFactor