Better Numbered List NLP with spaCy
2023-05-07
From NASA
I've been working on natural language processing (NLP) tasks recently with spaCy to help build a named entity recognition pipeline that will be used to train a machine learning model.
I'm using the en_core_web_sm trained pipeline, but it's sentence segmentation doesn't handle numbered lists the way I need it to. I don't want the periods in the numbered list to cause a sentence break.
In other words, I want this example input text to have 3 sentences:
This is a test.
1. This is the first item.
2. This is the second item.
spaCy's tokenizer has an add_special_case function that can be used to treat the number and period in a numbered list as a single token and prevent sentence splitting on the period.
nlp.tokenizer.add_special_case('1.', [{
'ORTH': '1.',
'NORM': 'Number 1:',
}])
We can use a loop to build special cases for the first one thousand numbers:
for n in range(1, 1001):
nlp.tokenizer.add_special_case(f'{n}.', [{
'ORTH': f'{n}.',
'NORM': f'Number {n}:',
}])