[language processing and Python] 7.2 block

The basic technical entity recognition is block(chunking)


Noun phrase chunking (NP- block)

Here are some have labeled examples:


In square brackets is a noun tagging examples.

One of the sources of NP- block information is the most useful part of speech tag.

In order to create a NP block, we will first define a grammar, stipulates that sentences should be how to block.

We use a regular expression to definitions, rules can we set ourselves: a NP block consists of an optional determiner (DT) followed by any number of adjectives (JJ), and then is a noun (NN).

The following is a sample code:


>>>sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>>grammar= "NP: {<DT>?<JJ>*<NN>}" 
>>>cp = nltk.RegexpParser(grammar) 
>>>result = cp.parse(sentence) 
>>>print result 
(NP the/DT little/JJ yellow/JJdog/NN)
(NP the/DT cat/NN))


Marker mode is similar to a regular expression pattern. <DT>?<JJ>*<NN>

To use regular expressions to block

In this example, only one rule, defines two rules. Still can be similar to the above approach to block.

NNP is a proper noun; DT qualifier; PP$for the possessive pronoun ($as special symbols, must be escaped by \ matching); JJ is an adjective,

grammar= r"""
NP:{<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive,adjectives and nouns
{<NNP>+} #chunksequences of propernouns
cp= nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>>print cp.parse(sentence) 
(NP Rapunzel/NNP)
(NP her/PP$long/JJ golden/JJhair/NN))

If the labeling pattern matching position overlap, the left most preferred.

For example:

>>>nouns= [("money", "NN"), ("market", "NN"), ("fund", "NN")]
>>>grammar= "NP: {<NN><NN>} #Chunktwo consecutive nouns"
>>>cp = nltk.RegexpParser(grammar)
>>>print cp.parse(nouns)
(S (NP money/NNmarket/NN)fund/NN)

In order to solve this problem, can improve the rules: NP:{<NN>+}.

Search for text corpus

We can also be used to do the same job specific block is more easily:

>>>cp = nltk.RegexpParser('CHUNK: {<V.*><TO><V.*>}')
>>>brown= nltk.corpus.brown
>>>for sent in brown.tagged_sents():
... tree = cp.parse(sent)
... for subtree in tree.subtrees():
... if subtree.node =='CHUNK': print subtree
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBDto/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK seems/VBZ to/TO overtake/VB)
(CHUNK want/VBto/TO buy/VB)

With the gap

With the gap, is out of an identifier sequence from a bulk of. If the identifier sequence matching through into a whole, so this one will be removed.

The following example will demonstrate code:

grammar= r"""
    {<.*>+} #Chunkeverything
    }<VBD|IN>+{ #Chinksequences of VBDand IN
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
    ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp= nltk.RegexpParser(grammar)
>>>print cp.parse(sentence)
    (NP the/DT little/JJ yellow/JJdog/NN)
    (NP the/DT cat/NN))

Block said: mark and the tree

Block structure is the annotation and analysis of the intermediate state between.

Block structure can be represented by a mark or Shu Lai. Use the most, is the use of IOB markers.

Each identifier is used in three special block one of the annotation tag.

Inside the I, external O, B. General need not specify in outside the block identifier type, are marked as O.

As shown in Fig.:



Similarly, the block can also use tree representation. As shown in Fig.


Posted by Jill at November 26, 2013 - 1:44 PM