Using spaCy¶

spaCy is a module for NLP is an open-source library that similar to gensim. It has useful modules such as Displacy . SpaCy is useful for NER as it has a different set of entity types and can label data different from nltk. It has informal lagnuage corpura as well which is useful for chat and Tweets. spaCy is the fastest library, and is designed to perform real work, rather than research.

Import modules¶

There are two ways to import spacy models as follows:

Using `spacy.load`¶

import spacy
nlp=spacy.load(`en`)
doc = nlp(u"This is a sentence.")
print([(w.text, w.pos_) for w in doc])

Importing as a module¶

!python -m spacy download en  
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp(u"This is a sentence.")
print([(w.text, w.pos_) for w in doc])

spaCy can also implement models of other languages such as german, spanish, portugese, french, multi-language, etc. The main difference would be changing en to de, es, pt, fr, etc.

Importing the french module¶

!python -m spacy download fr  
import fr_core_news_sm
nlp = fr_core_news_sm.load()
doc = nlp(u"C'est une phrase.")
print([(w.text, w.pos_) for w in doc])

Importing the multi-language module¶

!python -m spacy download XX  
import xx_ent_wiki_sm
nlp = xx_ent_wiki_sm.load()
doc = nlp(u"This is a line about Python")
print([(ent.text, ent.label) for ent in doc.ents])

In [30]:

# The '!' runs a command in the terminal.  For some reason, windows is unable to perform the symlink that is required for using spacy.load, thus we import spacy as a module
!python -m spacy download en  
import spacy
import os
import en_core_web_sm

Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in c:\users\xxklow\anaconda3\envs\pelican\lib\site-packages (2.0.0)

    Error: Couldn't link model to 'en'
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    C:\Users\xxklow\Anaconda3\envs\pelican\lib\site-packages\en_core_web_sm
    -->
    C:\Users\xxklow\Anaconda3\envs\pelican\lib\site-packages\spacy\data\en


    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    package name: nlp = spacy.load('{name}')
    Download successful but linking failed

Read file¶

In [31]:

currDir = os.getcwd()
inputDir = '\\inputs\\'
fileName = 'aeon.txt'
readFile = currDir + inputDir + fileName

f = open(readFile,'r')
doc_input = f.read()
f.close()

In [40]:

nlp = en_core_web_sm.load() # similar to gensim's corpus ans has pre-trained word vectors such that it can perform NER automatically.

doc = nlp(doc_input)

for ent in doc.ents[0:10]:
    print(ent.label_,ent.text)

NORP Jew
PERSON home’
NORP Jews
DATE the 19th century
ORG Jewishness
NORP Jews
NORP Christian
NORP Jewish
ORG Shylock
PERSON Rothschild

Creating list of tuple information

In [49]:

[(ent,ent.label_,) for ent in doc.ents[0:10]]

Out[49]:

[(Jew, 'NORP'),
 (home’, 'PERSON'),
 (Jews, 'NORP'),
 (the 19th century, 'DATE'),
 (Jewishness, 'ORG'),
 (Jews, 'NORP'),
 (Christian, 'NORP'),
 (Jewish, 'NORP'),
 (Shylock, 'ORG'),
 (Rothschild, 'PERSON')]

In [46]:

sent = nlp(u'This is a sentence.')
from spacy import displacy
displacy.serve(sent,style='dep')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-81a256e2f785> in <module>()
      1 sent = nlp(u'This is a sentence.')
      2 from spacy import displacy
----> 3 displacy.serve(sent,style='dep')

~\Anaconda3\envs\pelican\lib\site-packages\spacy\displacy\__init__.py in serve(docs, style, page, minify, options, manual, port)
     60     from wsgiref import simple_server
     61     render(docs, style=style, page=page, minify=minify, options=options,
---> 62            manual=manual)
     63     httpd = simple_server.make_server('0.0.0.0', port, app)
     64     prints("Using the '{}' visualizer".format(style),

~\Anaconda3\envs\pelican\lib\site-packages\spacy\displacy\__init__.py in render(docs, style, page, minify, jupyter, options, manual)
     37     renderer, converter = factories[style]
     38     renderer = renderer(options=options)
---> 39     parsed = [converter(doc, options) for doc in docs] if not manual else docs
     40     _html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
     41     html = _html['parsed']

~\Anaconda3\envs\pelican\lib\site-packages\spacy\displacy\__init__.py in <listcomp>(.0)
     37     renderer, converter = factories[style]
     38     renderer = renderer(options=options)
---> 39     parsed = [converter(doc, options) for doc in docs] if not manual else docs
     40     _html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
     41     html = _html['parsed']

~\Anaconda3\envs\pelican\lib\site-packages\spacy\displacy\__init__.py in parse_deps(orig_doc, options)
     87     RETURNS (dict): Generated dependency parse keyed by words and arcs.
     88     """
---> 89     doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
     90     if not doc.is_parsed:
     91         user_warning(Warnings.W005)

doc.pyx in spacy.tokens.doc.Doc.to_bytes()

~\Anaconda3\envs\pelican\lib\site-packages\spacy\util.py in to_bytes(getters, exclude)
    484         if key not in exclude:
    485             serialized[key] = getter()
--> 486     return msgpack.dumps(serialized, use_bin_type=True, encoding='utf8')
    487 
    488 

~\Anaconda3\envs\pelican\lib\site-packages\msgpack_numpy.py in packb(o, **kwargs)
    194     """
    195 
--> 196     return Packer(**kwargs).pack(o)
    197 
    198 def unpack(stream, **kwargs):

TypeError: __init__() got an unexpected keyword argument 'encoding'

In [ ]: