We go through a quick tutorial on using pandas.read_csv and the pandas_datareader specifically for downloading data from Ken French's website. We will extract the following datasets

10 US industry data of average value-weighted monthly returns.
5 Fama-French risk factor monthly returns.

Import the necessary modules for file management and change the working directories accordingly

In [52]:

from pathlib import Path
import sys
import os

home = str(Path.home())

if sys.platform == 'linux':
    inputDir = '/datasets/indices/' 
elif sys.platform == 'win32':
    inputDir = '\\datasets\indices\\' 

fullDir = home+inputDir
os.chdir(fullDir)

Reading from CSVs downloaded from Ken French's website¶

In [53]:

import pandas as pd

As the dates from the Ken French website CSVs are in the format 201008, we need to write a function to parse this data. We show that parsing a input string like 192607 is parsed correctly.

In [54]:

dateparse = lambda x: pd.datetime.strptime(x,'%Y%m') 
dateparse('192607')

Out[54]:

datetime.datetime(1926, 7, 1, 0, 0)

Use read_csv from pandas

Skip the first 11 rows as a header
When reading in the csv, parse dates
Use the dataparse lambda function that was created

Read the full timeseries of industry and risk factor data.

In [55]:

readDir = fullDir
readFile = readDir + 'FF_10_Industry_Portfolios.CSV' 
df_10indus_m = pd.read_csv(readFile,skiprows=11,nrows=1107,index_col=0,parse_dates=True,date_parser=dateparse) 


readFile = readDir + 'FF_Research_Data_5_Factors_2x3.CSV' 
df_5fac_m = pd.read_csv(readFile,skiprows=3,nrows=663,index_col=0,parse_dates=True,date_parser=dateparse)

Select the dates relevant for DeMiguel et al. (2009)

In [56]:

dfAsset = df_10indus_m.loc['1963-07-01':'2004-11-01']
dfFac = df_5fac_m.loc['1963-07-01':'2004-11-01']

In [57]:

df_10indus_m.head()

Out[57]:

	NoDur	Durbl	Manuf	Enrgy	HiTec	Telcm	Shops	Hlth	Utils	Other
1926-07-01	1.45	15.55	4.69	-1.18	2.90	0.83	0.11	1.77	7.04	2.16
1926-08-01	3.97	3.68	2.81	3.47	2.66	2.17	-0.71	4.25	-1.69	4.38
1926-09-01	1.14	4.80	1.15	-3.39	-0.38	2.41	0.21	0.69	2.04	0.29
1926-10-01	-1.24	-8.23	-3.63	-0.78	-4.58	-0.11	-2.29	-0.57	-2.63	-2.85
1926-11-01	5.21	-0.19	4.10	0.01	4.71	1.63	6.43	5.42	3.71	2.11

In [58]:

dfFac.head()

Out[58]:

	Mkt-RF	SMB	HML	RMW	CMA	RF
1963-07-01	-0.39	-0.47	-0.83	0.64	-1.15	0.27
1963-08-01	5.07	-0.78	1.67	0.34	-0.40	0.25
1963-09-01	-1.57	-0.48	0.18	-0.75	0.24	0.27
1963-10-01	2.53	-1.29	-0.10	2.74	-2.24	0.29
1963-11-01	-0.85	-0.84	1.71	-0.44	2.22	0.27

Reading directly from Ken French's website¶

In [59]:

import pandas_datareader.data as web  # module for reading datasets directly from the web
from pandas_datareader.famafrench import get_available_datasets
import pickleshare

We extract all the available datasets from Ken French's website and find that there are 286 of them. We can opt to see all the datasets available.

In [60]:

datasets = get_available_datasets()
print('No. of datasets:{0}'.format(len(datasets)))
#datasets # comment out if you want to see all the datasets

No. of datasets:286

US Industry dataset¶

We are looking for a dataset of US 10 industries, thus use the keywords '10' and 'industry' to find out what the names of the relevant datasets.

In [61]:

df_10_industry = [dataset for dataset in datasets if '10' in dataset and 'Industry' in dataset]
print(df_10_industry)

['10_Industry_Portfolios', '10_Industry_Portfolios_Wout_Div', '10_Industry_Portfolios_daily']

We select 10_Industry_Portfolios from July 1963 to November 2004 (as per DeMiguel et al., 2009) If you do not have start or end dates, the default will extract portfolios from 2010 to the latest datapoints available

In [62]:

ds_industry = web.DataReader(df_10_industry[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting '10_Industry_Portfolios'

Obtaining data from the datareader returns a dict. Thus we want to see what is inside the dict.

In [63]:

print(type(ds_industry))
ds_industry.keys()

<class 'dict'>

Out[63]:

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 'DESCR'])

We find that there are keys from 0-7, and also a DESCR. We read the contents of DESCR and find that it explains what dataset each key from 0-7 corresponds to.

In [64]:

print(ds_industry['DESCR'])

10 Industry Portfolios
----------------------

This file was created by CMPT_IND_RETS using the 201812 CRSP database. It contains value- and equal-weighted returns for 10 industry portfolios. The portfolios are constructed at the end of June. The annual returns are from January to December. Missing data are indicated by -99.99 or -999. Copyright 2018 Kenneth R. French

  0 : Average Value Weighted Returns -- Monthly (497 rows x 10 cols)
  1 : Average Equal Weighted Returns -- Monthly (497 rows x 10 cols)
  2 : Average Value Weighted Returns -- Annual (42 rows x 10 cols)
  3 : Average Equal Weighted Returns -- Annual (42 rows x 10 cols)
  4 : Number of Firms in Portfolios (497 rows x 10 cols)
  5 : Average Firm Size (497 rows x 10 cols)
  6 : Sum of BE / Sum of ME (42 rows x 10 cols)
  7 : Value-Weighted Average of BE/ME (42 rows x 10 cols)

DeMiguel et al. (2009), use average value-weighted returns, thus we will use dataset 0.

In [65]:

ds_industry[0].head()

Out[65]:

	NoDur	Durbl	Manuf	Enrgy	HiTec	Telcm	Shops	Hlth	Utils	Other
Date
1963-07	-0.49	-0.22	-1.41	2.29	-0.69	-0.23	-1.03	0.56	0.80	-1.61
1963-08	4.89	6.55	6.20	3.93	5.14	4.29	6.43	9.56	4.20	5.49
1963-09	-1.69	-0.24	-0.76	-3.64	0.13	2.36	0.97	-4.06	-2.50	-3.16
1963-10	2.65	9.72	2.58	-0.32	8.29	3.40	0.52	3.38	-0.67	1.38
1963-11	-1.13	-4.84	0.30	-1.15	-0.29	4.16	-1.23	-1.65	-1.02	0.23

Risk factor dataset¶

We perform the same process above Fama-French risk factor portfolio dataset

In [66]:

df_5_factor = [dataset for dataset in datasets if '5' in dataset and 'Factor' in dataset]
print(df_5_factor)
ds_factors = web.DataReader(df_5_factor[0],'famafrench',start='1963-07-01',end='2004-11-01') # Taking [0] as extracting 1F-F-Research_Data_Factors_2x3')
print('\nKEYS\n{0}'.format(ds_factors.keys()))
print('DATASET DESCRIPTION \n {0}'.format(ds_factors['DESCR']))
ds_factors[0].head()

['F-F_Research_Data_5_Factors_2x3', 'F-F_Research_Data_5_Factors_2x3_daily', 'Global_5_Factors', 'Global_5_Factors_Daily', 'Global_ex_US_5_Factors', 'Global_ex_US_5_Factors_Daily', 'Europe_5_Factors', 'Europe_5_Factors_Daily', 'Japan_5_Factors', 'Japan_5_Factors_Daily', 'Asia_Pacific_ex_Japan_5_Factors', 'Asia_Pacific_ex_Japan_5_Factors_Daily', 'North_America_5_Factors', 'North_America_5_Factors_Daily']

KEYS
dict_keys([0, 1, 'DESCR'])
DATASET DESCRIPTION 
 F-F Research Data 5 Factors 2x3
-------------------------------

This file was created by CMPT_ME_BEME_OP_INV_RETS using the 201812 CRSP database. The 1-month TBill return is from Ibbotson and Associates Inc.

  0 : (497 rows x 6 cols)
  1 : Annual Factors: January-December (41 rows x 6 cols)

Out[66]:

	Mkt-RF	SMB	HML	RMW	CMA	RF
Date
1963-07	-0.39	-0.47	-0.83	0.66	-1.15	0.27
1963-08	5.07	-0.79	1.67	0.39	-0.40	0.25
1963-09	-1.57	-0.48	0.18	-0.76	0.24	0.27
1963-10	2.53	-1.29	-0.10	2.75	-2.24	0.29
1963-11	-0.85	-0.84	1.71	-0.45	2.22	0.27

We create copies of the industry and risk factor returns that we read from Ken French's website into dfAsset and dfFactor respectively.

In [67]:

dfAsset = ds_industry[0].copy()/100
dfFactor = ds_factors[0].copy()/100

We create excess returns by subtracting the risk-free rate from the asset returns

In [68]:

dfXsAsset = dfAsset.sub(dfFactor['RF'],axis=0)
dfXsAsset.head()

Out[68]:

	NoDur	Durbl	Manuf	Enrgy	HiTec	Telcm	Shops	Hlth	Utils	Other
Date
1963-07	-0.0076	-0.0049	-0.0168	0.0202	-0.0096	-0.0050	-0.0130	0.0029	0.0053	-0.0188
1963-08	0.0464	0.0630	0.0595	0.0368	0.0489	0.0404	0.0618	0.0931	0.0395	0.0524
1963-09	-0.0196	-0.0051	-0.0103	-0.0391	-0.0014	0.0209	0.0070	-0.0433	-0.0277	-0.0343
1963-10	0.0236	0.0943	0.0229	-0.0061	0.0800	0.0311	0.0023	0.0309	-0.0096	0.0109
1963-11	-0.0140	-0.0511	0.0003	-0.0142	-0.0056	0.0389	-0.0150	-0.0192	-0.0129	-0.0004

Pickling data¶

We pickle our files that now can be retrieved from other notebooks.

In [69]:

storeDir = fullDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
db['dfXsAsset'] = dfXsAsset
db['dfFactor'] = dfFactor

Reading from CSVs downloaded from Ken French's website¶

Reading directly from Ken French's website¶

US Industry dataset¶

Risk factor dataset¶

Pickling data¶

Comments