Working with Kaggle datasets

Exploring the kaggle command API

We explore what commands can be executed using Kaggle.

  • competitions (c) : exploring the available competitions.
  • datasets (d) : explore the available datasets.
  • kernels (k) : explore the available kernels.
In [1]:
!kaggle -h
usage: kaggle [-h] [-v] {competitions,c,datasets,d,kernels,k,config} ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

commands:
  {competitions,c,datasets,d,kernels,k,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle competitions
    datasets (d)        Commands related to Kaggle datasets
    kernels (k)         Commands related to Kaggle kernels
    config              Configuration settings

Exploring kaggle competitions

We can see that there are many competitions available.

  • ref : Name of the competition.
  • deadline: When the competition is due for its submission.
  • category: Competitions have different categories as follows
    • featured: These competitions usually have prize money (i.e., 25k-100k) split between the top 3 teams.
    • playground: These datasets are just to play and explore.
    • gettingStarted: Beginners should try exploring these datasets to get new skills
    • masters: Machine learning experts can try these datasets and win prize money >100k.
    • research: These are datasets for research purposes.
    • recruitment: Firms are using kaggle to identify new hires so you can try these datasets to build up your profile.
  • reward: Total prize money for the top x3 teams entering the competition.
  • teamCount: Number of teams that are entering the competition.
  • userHasEntered: Have you entered the specified competition?
In [12]:
!kaggle c list
ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2689           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      10416            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4480            True  
imagenet-object-localization-challenge         2029-12-31 07:00:00  Research         Knowledge         31           False  
competitive-data-science-predict-future-sales  2019-12-31 23:59:00  Playground           Kudos       2125           False  
two-sigma-financial-news                       2019-07-15 23:59:00  Featured          $100,000       2901           False  
LANL-Earthquake-Prediction                     2019-06-03 23:59:00  Research           $50,000        282           False  
histopathologic-cancer-detection               2019-03-30 23:59:00  Playground       Knowledge        406           False  
petfinder-adoption-prediction                  2019-03-28 23:59:00  Featured           $25,000        588           False  
vsb-power-line-fault-detection                 2019-03-21 23:59:00  Featured           $25,000        396           False  
microsoft-malware-prediction                   2019-03-13 23:59:00  Research           $25,000       1006           False  
humpback-whale-identification                  2019-02-28 23:59:00  Featured           $25,000       1224           False  
elo-merchant-category-recommendation           2019-02-26 23:59:00  Featured           $50,000       2613           False  
ga-customer-revenue-prediction                 2019-02-15 23:59:00  Featured           $45,000       1104           False  
reducing-commercial-aviation-fatalities        2019-02-12 23:59:00  Playground            Swag         70           False  
quora-insincere-questions-classification       2019-02-05 23:59:00  Featured           $25,000       3358           False  
pubg-finish-placement-prediction               2019-01-30 23:59:00  Playground            Swag       1378           False  
20-newsgroups-ciphertext-challenge             2019-01-16 23:59:00  Playground            Swag        136           False  
human-protein-atlas-image-classification       2019-01-10 23:59:00  Featured           $37,000       2172           False  
traveling-santa-2018-prime-paths               2019-01-10 23:59:00  Featured           $25,000       1874           False  

gettingStarted competitions are shown below

In [9]:
!kaggle c list --category gettingStarted
ref                                          deadline             category            reward  teamCount  userHasEntered  
-------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
digit-recognizer                             2030-01-01 00:00:00  Getting Started  Knowledge       2689           False  
titanic                                      2030-01-01 00:00:00  Getting Started  Knowledge      10416            True  
house-prices-advanced-regression-techniques  2030-01-01 00:00:00  Getting Started  Knowledge       4480            True  
facial-keypoints-detection                   2017-01-07 00:00:00  Getting Started  Knowledge        175           False  
street-view-getting-started-with-julia       2017-01-07 00:00:00  Getting Started  Knowledge         56           False  
word2vec-nlp-tutorial                        2015-06-30 23:59:00  Getting Started  Knowledge        578           False  
data-science-london-scikit-learn             2014-12-31 23:59:00  Getting Started  Knowledge        191           False  
just-the-basics-the-after-party              2013-03-01 01:00:00  Getting Started  Knowledge         48           False  
just-the-basics-strata-2013                  2013-02-26 20:30:00  Getting Started  Knowledge         49           False  

masters competitions are shown below

In [10]:
!kaggle c list --category masters
ref                                           deadline             category    reward  teamCount  userHasEntered  
--------------------------------------------  -------------------  --------  --------  ---------  --------------  
risky-business                                2014-06-04 23:59:00  Masters   $100,000         44           False  
genentech-flu-forecasting                     2014-03-03 23:59:00  Masters   $125,000         50           False  
deloitte-churn-prediction                     2013-12-21 23:59:00  Masters    $70,000         37           False  
mastercard-data-cleansing-competition-finals  2013-08-07 23:59:00  Masters   $100,000          6           False  
RxVolumePrediction                            2013-02-04 00:00:00  Masters        USD         12           False  
customer-retention                            2012-12-12 00:00:00  Masters        USD         12           False  

playground competitions are shown below

In [11]:
!kaggle c list --category playground
ref                                            deadline             category       reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ----------  ---------  ---------  --------------  
competitive-data-science-predict-future-sales  2019-12-31 23:59:00  Playground      Kudos       2125           False  
histopathologic-cancer-detection               2019-03-30 23:59:00  Playground  Knowledge        406           False  
reducing-commercial-aviation-fatalities        2019-02-12 23:59:00  Playground       Swag         70           False  
pubg-finish-placement-prediction               2019-01-30 23:59:00  Playground       Swag       1378           False  
20-newsgroups-ciphertext-challenge             2019-01-16 23:59:00  Playground       Swag        136           False  
dont-call-me-turkey                            2018-11-26 23:59:00  Playground       Swag        267           False  
new-york-city-taxi-fare-prediction             2018-09-25 23:59:00  Playground  Knowledge       1488           False  
forest-cover-type-kernels-only                 2018-09-24 23:59:00  Playground  Knowledge        359           False  
demand-forecasting-kernels-only                2018-09-24 23:59:00  Playground  Knowledge        462           False  
whats-cooking-kernels-only                     2018-09-24 23:59:00  Playground  Knowledge        523           False  
flavours-of-physics-kernels-only               2018-09-24 23:59:00  Playground  Knowledge         64           False  
movie-review-sentiment-analysis-kernels-only   2018-09-24 23:59:00  Playground  Knowledge        410           False  
costa-rican-household-poverty-prediction       2018-09-19 23:59:00  Playground       Swag        619           False  
whale-categorization-playground                2018-07-09 23:59:00  Playground      Kudos        528           False  
donorschoose-application-screening             2018-04-25 23:59:00  Playground       Swag        581           False  
plant-seedlings-classification                 2018-03-12 23:59:00  Playground      Kudos        836           False  
dog-breed-identification                       2018-02-26 23:59:00  Playground      Kudos       1286           False  
spooky-author-identification                   2017-12-15 23:59:00  Playground    $25,000       1244           False  
nyc-taxi-trip-duration                         2017-09-15 23:59:00  Playground    $30,000       1257           False  
invasive-species-monitoring                    2017-08-15 23:59:00  Playground  Knowledge        513           False  

featured competitions are shown below

In [3]:
!kaggle competitions list --category featured
ref                                              deadline             category    reward  teamCount  userHasEntered  
-----------------------------------------------  -------------------  --------  --------  ---------  --------------  
two-sigma-financial-news                         2019-07-15 23:59:00  Featured  $100,000       2901           False  
petfinder-adoption-prediction                    2019-03-28 23:59:00  Featured   $25,000        588           False  
vsb-power-line-fault-detection                   2019-03-21 23:59:00  Featured   $25,000        396           False  
humpback-whale-identification                    2019-02-28 23:59:00  Featured   $25,000       1224           False  
elo-merchant-category-recommendation             2019-02-26 23:59:00  Featured   $50,000       2611           False  
ga-customer-revenue-prediction                   2019-02-15 23:59:00  Featured   $45,000       1104           False  
quora-insincere-questions-classification         2019-02-05 23:59:00  Featured   $25,000       3356           False  
human-protein-atlas-image-classification         2019-01-10 23:59:00  Featured   $37,000       2172           False  
traveling-santa-2018-prime-paths                 2019-01-10 23:59:00  Featured   $25,000       1874           False  
PLAsTiCC-2018                                    2018-12-17 23:59:00  Featured   $25,000       1094           False  
quickdraw-doodle-recognition                     2018-12-04 23:59:00  Featured   $25,000       1316           False  
airbus-ship-detection                            2018-11-14 23:59:00  Featured   $60,000        884           False  
rsna-pneumonia-detection-challenge               2018-10-31 23:59:00  Featured   $30,000       1499           False  
tgs-salt-identification-challenge                2018-10-19 23:59:00  Featured  $100,000       3234           False  
google-ai-open-images-object-detection-track     2018-08-30 23:59:00  Featured   $30,000        454           False  
google-ai-open-images-visual-relationship-track  2018-08-30 23:59:00  Featured   $20,000        232           False  
home-credit-default-risk                         2018-08-29 23:59:00  Featured   $70,000       7198            True  
santander-value-prediction-challenge             2018-08-20 23:59:00  Featured   $60,000       4484           False  
trackml-particle-identification                  2018-08-13 23:59:00  Featured   $25,000        656           False  
youtube8m-2018                                   2018-08-06 23:59:00  Featured   $25,000        312           False  

Example: View the titanic dataset leaderboard

We can explore the best scores of each team/user for a specific competition. We can see that many people have a top score of 1.0 which means they have perfect prediction!

In [13]:
!kaggle competitions leaderboard titanic --show
 teamId  teamName                                                         submissionDate       score    
-------  ---------------------------------------------------------------  -------------------  -------  
2393228  Huanhuan                                                         2018-11-23 04:50:19  1.00000  
2390269  vis pe                                                           2018-11-24 03:24:56  1.00000  
2409855  peterdog                                                         2018-11-24 19:06:10  1.00000  
2376928  Dhruva Adike                                                     2018-12-01 14:54:20  1.00000  
2479377  xxxhhhmmm                                                        2018-12-03 23:11:14  1.00000  
2482519  chikooo                                                          2018-12-04 07:59:19  1.00000  
 619228  Ben Zhai                                                         2018-12-04 12:15:13  1.00000  
2515040  being lost2                                                      2018-12-11 15:57:43  1.00000  
1975214  being lost                                                       2018-12-12 03:12:43  1.00000  
1939275  Mitsuyama Gauss                                                  2018-12-14 06:22:43  1.00000  
 657945  TingHsi                                                          2018-12-22 14:40:25  1.00000  
2570532  Belkhiri                                                         2018-12-24 18:03:59  1.00000  
2573467  saathvik                                                         2018-12-25 17:22:08  1.00000  
1730345  wanghaoran                                                       2018-12-27 12:40:09  1.00000  
 777168  YouHan Lee                                                       2018-12-30 17:47:22  1.00000  
2603099  tu                                                               2019-01-06 16:19:51  1.00000  
 634867  VLMVLM                                                           2019-01-13 13:44:11  1.00000  
2315844  signinhere                                                       2019-01-13 01:04:09  1.00000  
2614517  Oh my dog                                                        2019-01-14 01:39:07  1.00000  
2644265  tfgz tf gzder tfgz                                               2019-01-14 13:47:54  1.00000  
2424863  OM Bharatiya                                                     2018-11-23 17:41:22  0.99521  
2515893  longxingchen                                                     2018-12-11 17:07:55  0.99521  
2262642  NUDT丁兆云DM课程2018ZDY                                               2018-12-02 12:37:07  0.99043  
2581969  Luanpeii                                                         2019-01-06 00:50:14  0.99043  
2399327  testing01                                                        2018-11-22 06:18:45  0.97607  
 357660  sorrowise                                                        2018-11-24 11:09:39  0.97607  
2597178  yinxuanyu                                                        2019-01-02 09:45:11  0.97607  
2482814  zsnwazi                                                          2019-01-03 12:30:08  0.97607  
2353616  aries4011                                                        2018-11-30 23:56:43  0.97129  
 444510  Sulaimon  A. Afolabi                                             2019-01-01 08:21:50  0.96172  
 830908  Seray Beser                                                      2018-11-21 12:19:28  0.95215  
 846998  nacun                                                            2018-11-27 08:55:02  0.94258  
2567135  Shobana Athiappan                                                2018-12-24 14:52:36  0.91866  
2055988  Hsu德明                                                            2018-11-15 02:37:35  0.91387  
1277179  LDPy                                                             2018-12-03 10:42:05  0.91387  
1753299  akshay00                                                         2018-11-14 19:18:02  0.90909  
1121807  dmy1995                                                          2018-11-15 02:50:26  0.90909  
2065360  Noushad Ali                                                      2018-11-15 05:28:02  0.90909  
2358700  WeTrainOnTestData                                                2018-11-16 04:45:29  0.90909  
2394826  PONG                                                             2018-11-18 14:26:44  0.90909  
2380062  Khalil Lazraq                                                    2018-11-20 00:10:34  0.90909  
1794546  rp1611                                                           2018-11-22 23:20:32  0.90909  
 221819  Mathurin Aché                                                    2018-11-20 20:07:21  0.90909  
1113997  YasinSancaktutan                                                 2018-11-22 05:49:31  0.90909  
2311178  thexxy                                                           2018-11-24 06:58:52  0.90909  
 599132  Arunkumar V Ramanan                                              2018-11-25 09:38:21  0.90909  
 241671  Niranjan Kumar Nakkala                                           2018-11-28 15:51:43  0.90909  
  43178  Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo  2018-11-30 09:24:05  0.90909  
2325183  soowhanpark                                                      2018-12-01 04:47:19  0.90909  
2087444  Vansh Jatana                                                     2018-12-01 13:21:35  0.90909  

Example: Downloading the titanic dataset

We will explore one of the most well-known datasets, that is the titanic dataset. Always list all the files associated to the competition of interest before downloading as some of the requied files can be >100MB. In the titanic dataset, the files are small since they are < 1MB.

In [10]:
!kaggle competitions files titanic
name                   size  creationDate         
---------------------  ----  -------------------  
train.csv              60KB  2013-06-28 13:40:25  
test.csv               28KB  2013-06-28 13:40:24  
gender_submission.csv   3KB  2017-02-01 01:49:18  

We can easily download the files into our selected directory

In [15]:
!kaggle competitions download titanic
Downloading train.csv to /home/randlow/github/blog2/posts/machine-learning
  0%|                                               | 0.00/59.8k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 59.8k/59.8k [00:00<00:00, 13.8MB/s]
Downloading test.csv to /home/randlow/github/blog2/posts/machine-learning
  0%|                                               | 0.00/28.0k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 28.0k/28.0k [00:00<00:00, 5.97MB/s]
Downloading gender_submission.csv to /home/randlow/github/blog2/posts/machine-learning
  0%|                                               | 0.00/3.18k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.18k/3.18k [00:00<00:00, 1.12MB/s]

Evaluating your submission scores

We can also evaluate our machine learning submission scores in Kaggle by competition. In the below example, the competition was the home-credit-default-risk competition

In [2]:
!kaggle competitions submissions -c home-credit-default-risk
fileName                                 date                 description  status    publicScore  privateScore  
---------------------------------------  -------------------  -----------  --------  -----------  ------------  
random-forest-home-loan-credit-risk.csv  2019-02-11 05:24:55  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:10:40  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-11 04:52:51  submitted    complete  0.66223      0.67583       
random-forest-home-loan-credit-risk.csv  2019-02-11 04:44:50  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-08 04:08:33  submitted    complete  0.66223      0.67583       

Summary

We have learnt how to use the kaggle API to explore kaggle competitions and download datasets. We also learnt how to obtain our submitted machine learning model performance scores based on our competition submissions. For more details see the Kaggle API Github or see the documentation on the Kaggle website.

My Python workflow for data science and financial research

This summarizes my Python setup for data science/finance/economics research. It includes some of the useful Python packages, JupyterLab extensions, and programs I find useful in my development.

4 minute read…

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on extracting and understanding important information from text. Some applications are as follows:

  1. Long essay (>5,000 words) to read and understand we can easily extract what the major themes of the text (i.e., bag-of-words analysis).
  2. A whole set of documents (i.e., corpus) that we know is about cooking, but we want to see what the major themes are in the corpus like does the corpus mention value Asian cooking or Western cooking more (i.e., term frequency-inverse document frequency).
  3. Chatbot application where we need to understand what questions a person is asking, and then redirect him to the right resources to answer his question.
  4. Need to translate a set of documents from English to French, or vice-versa.
  5. Trying to analyze the general sentiment on a specific topic (i.e., immigration) by using a large set of documents written on that topic.

Specifically for finance, NLP can be used in the following applications * Extracting the buy/sell sentiment on a corpus of sell-side analysts reports on a specific stock. * Understanding economic sentiment from conference calls, online job postings, inflation chatter, e-invoicing, etc. * Extracting the amount of media attention to political events * Extracing a long-term view of the global economic sentiment from Warran Buffet's annual reports.

There are several popular NLP libraries as shown below and I found the nice summary below from a ActiveWizards:

nlp library

Important packages to install using conda for Natural Language Processing are:

  1. gensim
  2. stop_words

In this blog series, we will go through the following packages where [x] is available and [ ] is in the pipeline:

  1. [x] Regular expressions with re
  2. [x] Tokenization with nltk
  3. [x] Bag-of-Words (BoW) with nltk
  4. [x] Bag-of-Words (BoW) with gensim
  5. [x] Named-Entity-Recognition (NER) with nltk
  6. [ ] Production grade nlp with spacy
  7. [ ] Translation with polyglot
  8. [x] Text classification with sklearn
  9. [ ] Text scraping with BeautifulSoup:
  10. [ ] [CNNs with t]

Timeseries forecasting

Timeseries forecasting

Timeseries forecasting can be generally split into two categories

1) Signal processing. Signal processing is typically what is used in engineering and econometrics. ARIMA/GARCH models attempt to filter out the 'signals' from the noise and extrapolate the signals into the future. Famous models for interest rate pricing are 2-factor models (i.e., Vasicek models, Cox-Ingersoll-Ross) models. CIR models allow for mean-reversion,

4 minute read…