Better Numbered List NLP with spaCy

2023-05-07

Hubble 33rd Anniversary From NASA

I've been working on natural language processing (NLP) tasks recently with spaCy to help build a named entity recognition pipeline that will be used to train a machine learning model.

I'm using the en_core_web_sm trained pipeline, but it's sentence segmentation doesn't handle numbered lists the way I need it to. I don't want the periods in the numbered list to cause a sentence break.

In other words, I want this example input text to have 3 sentences:

This is a test.

1. This is the first item.
2. This is the second item.

spaCy's tokenizer has an add_special_case function that can be used to treat the number and period in a numbered list as a single token and prevent sentence splitting on the period.

nlp.tokenizer.add_special_case('1.', [{
    'ORTH': '1.',
    'NORM': 'Number 1:',
}])

We can use a loop to build special cases for the first one thousand numbers:

for n in range(1, 1001):
    nlp.tokenizer.add_special_case(f'{n}.', [{
        'ORTH': f'{n}.',
        'NORM': f'Number {n}:',
    }])