[Python] 6.3 language processing and evaluation

The test set

Note can be used to test and can be used for the tradeoff between the training data test set.

The selection of test set another consideration is similar examples of test set examples and focus on the development of the. The two data set is more similar, we will evaluate the results generalize to other data sets of confidence is smaller.


For the assessment of a simple metric classification accuracy.

Explain a classifier accuracy scores, considering the test to focus on a single class label frequency is very important.

Precision and recall

, The true positive is related to the project we correctly identified as relevant.

, True negative is not related to project we correctly identified as not related.

, False positive (or type I error) is not related to project we identified as related errors.

, False negative (or type II error) is related to the project we error identification for not related.

Given these four numbers, we can define the following indicators:

Accuracy (Precision), we found in the project and how much is related to, TP/(TP+FP).

The recall rate (Recall), said the project we found out how much, TP/(TP+FN).

The F- measure (F-Measure) (or F- score, F-Score), combination of precision and recall as a separate score, is defined as the harmonic mean of precision and recall rate(2 ×Precision×Recall)/(Precision+Recall).

The confusion matrix

A confusion matrix representation of information quantity error.

Can use the following code to enhance mutual understanding:

>>>def tag_list(tagged_sents):
... return [tag for sent in tagged_sents for (word, tag) in sent]
>>>def apply_tagger(tagger, corpus):
... return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]
>>>gold =tag_list(brown.tagged_sents(categories='editorial'))
>>>test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
>>>cm = nltk.ConfusionMatrix(gold, test)

The result is: (readers to adjust the display, some tightly together)

| N |
| N I A J N V N|
| N N T J . S , B P |
NN| <11.8%> 0.0% . 0.2% . 0.0% . 0.3% 0.0%|
IN | 0.0% <9.0%> . . . 0.0% . . . |
AT| . . <8.6%> . . . . . . |
JJ | 1.6% . . <4.0%> . . . 0.0% 0.0% |
. | . . . . <4.8%> . . . . |
NS| 1.5% . . . . <3.2%> . . 0.0%|
, | . . . . . . <4.4%> . . |
B| 0.9% . . 0.0% . . . <2.4%> . |
NP| 1.0% . . 0.0% . . . . <1.9%>|

Cross validation


To evaluate our model, we must retain a portion of the data has been labeled as the test set. As we have mentioned, if the test set is too small, our evaluation may not be accurate. However, the test set larger usually means that the training set is small, if the amount of data annotated Co., which will have a significant impact on Performance.

The solution:

One solution to this problem is to perform multiple evaluation in different test sets, and then combine these assessment score, this technique is called cross validation. In particular, we will be a raw corpus is divided into N subsets called folding(folds).

For each of these folding, we use the folding data to all other data in the training model, and then in the folding test model. Even if the folding may the individual is too small and can not be in their scores are given accurate evaluation, comprehensive evaluation score is based on a large amount of data, so it is fairly reliable.


It allows us to study the training set different performance changes much. If we get very similar scores from all N training, then we can be fairly confident, score is accurate. On the other hand, if the N training set score very different, so, we should be on the accuracy assessment scores are sceptical

Posted by Corrine at November 24, 2013 - 4:33 AM