Overview
Hej och välkommen!
Two popular NLP libraries in the Python sphere are SpaCy and Polyglot. In this tutorial, the performance of both libraries will be discussed with a particular focus on the Swedish language, showcasing its abilities in tokenization, Part-of-Speech Tagging, and Named Entity Recognition.
Introduction to SpaCy and Polyglot
SpaCy is an open-source NLP library that provides a wide range of features, including part-of-speech tagging, dependency parsing, named entity recognition, and text classification. It is a powerful library designed for production environments. With its focus on performance and ease of use, SpaCy has become a popular choice for building applications that require sophisticated text processing, even for less-resourced languages like Swedish.
Polyglot, on the other hand, is an NLP library whose primary selling point is its extensive language coverage, with well over 130 supported languages in its arsenal, including Swedish. It offers features like tokenization, part-of-speech tagging, and named entity recognition. Unlike its SpaCy counterpart, Polyglot offers a sentiment analysis feature.
SpaCy
Developed by Explosion AI
Open-source software, licensed under the MIT License
Targets version 3.4
Commonly used for text preprocessing, named entity recognition, part-of-speech tagging, and dependency parsing
Supports over 65 languages
Known for its speed and efficiency. It is also widely used in industry and has a large community of contributors. The library is easy to use and provides an extensive set of features.
Polyglot
Developed by Rami Al-Rfou
Open-source software, licensed under the GPLv3 License
Targets version 16.07.04
Commonly used for language detection, named entity recognition, part-of-speech tagging, sentiment analysis, and word embeddings
Supports over 130 languages
Known for its wide range of language support. It provides a set of features that are not available in other NLP tools, such as machine translation and language detection. The library is easy to use and provides extensive documentation.
Installation and Setup
Make sure you have Python 3 installed on your system. If you don’t have it, you can download it from the official Python website.
Installing SpaCy
pip install spacy
After installing SpaCy, download the Swedish language model.
python -m spacy download sv_core_web_sm
Installing Polyglot
pip install polyglot
Polyglot relies on some additional dependencies that need to be installed separately. Install the following packages:
pip install PyICU
pip install pycld2
pip install morfessor
Named Entity Recognition & Part-of-Speech Tagging
To showcase the functionality of both SpaCy and Polyglot, I will perform text preprocessing, named entity recognition (NER), and part-of-speech (POS) tagging using both libraries and compare the results. For the sake of time, only a small sample dataset will be used.
1. Loading a Sample Dataset
sample_data = [
"IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.",
"Stockholm är huvudstad i Sverige.",
"Skåne ligger i södra Sverige och är känt för sina vackra landskap."
]
2. Preprocessing the Text Using SpaCy and Polyglot
Import the necessary modules and load the language model for both libraries.
import spacy
from polyglot.text import Text
nlp_spacy = spacy.load("sv_core_news_sm")
Preprocess the text using both libraries.
# Using SpaCy
spacy_docs = [nlp_spacy(text) for text in sample_data]
# Using Polyglot
polyglot_docs = [Text(text, hint_language_code="sv") for text in sample_data]
3. NER and POS tagging with SpaCy
for doc in spacy_docs:
print(f"Text: {doc.text}")
print("Named Entities:")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
print("Part-of-Speech Tags:")
for token in doc:
print(f"{token.text}: {token.pos_}")
print("\n")
4. NER and POS tagging with Polyglot
for doc in polyglot_docs:
print(f"Text: {doc.raw}")
print("Named Entities:")
for entity in doc.entities:
print(f"{entity}: {entity.tag}")
print("Part-of-Speech Tags:")
for word, tag in zip(doc.words, doc.pos_tags):
print(f"{word}: {tag}")
print("\n")
5. Comparing the Results
SpaCy
Text: IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.
Named Entities:
IKEA: ORG
Älmhult: LOC
Part-of-Speech Tags:
IKEA: PROPN
är: AUX
ett: DET
svenskt: ADJ
möbelföretag: NOUN
med: ADP
huvudkontor: NOUN
i: ADP
Älmhult: PROPN
Polyglot
Text: IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.
Named Entities:
IKEA: I-ORG
Älmhult: I-LOC
Part-of-Speech Tags:
IKEA: NOUN
är: VERB
ett: DET
svenskt: ADJ
möbelföretag: NOUN
med: ADP
huvudkontor: NOUN
i: ADP
Älmhult: NOUN
Both libraries perform well in POS tagging, showing similar accuracy. SpaCy and Polyglot have produced similar NER results, recognizing “IKEA” as an organization and “Älmhult” as a location. The only difference is the way they represent these entities, with SpaCy using “ORG” and “LOC” labels, while Polyglot uses “I-ORG” and “I-LOC.”
Tokenization
Tokenization is the process of breaking down text into individual words, phrases, symbols, or other meaningful elements called tokens. I will use the same sample text from the previous section.
1. Dataset
sample_data = [
"IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.",
"Stockholm är huvudstad i Sverige.",
"Skåne ligger i södra Sverige och är känt för sina vackra landskap."
]
2. Tokenizing the Set
# Using SpaCy
spacy_tokens = [[token.text for token in nlp_spacy(text)] for text in sample_data]
# Using Polyglot
polyglot_tokens = [[word for word in Text(text, hint_language_code="sv").words] for text in sample_data]
for i, (spacy_token_list, polyglot_token_list) in enumerate(zip(spacy_tokens, polyglot_tokens)):
print(f"Text {i + 1}:")
print(f"SpaCy Tokens: {spacy_token_list}")
print(f"Polyglot Tokens: {polyglot_token_list}\n")
3. Comparing the Results
Sample 1:
SpaCy Tokens: ['IKEA', 'är', 'ett', 'svenskt', 'möbelföretag', 'med', 'huvudkontor', 'i', 'Älmhult', '.']
Polyglot Tokens: ['IKEA', 'är', 'ett', 'svenskt', 'möbelföretag', 'med', 'huvudkontor', 'i', 'Älmhult', '.']
Sample 2:
SpaCy Tokens: ['Stockholm', 'är', 'huvudstad', 'i', 'Sverige', '.']
Polyglot Tokens: ['Stockholm', 'är', 'huvudstad', 'i', 'Sverige', '.']
Sample 3:
SpaCy Tokens: ['Skåne', 'ligger', 'i', 'södra', 'Sverige', 'och', 'är', 'känt', 'för', 'sina', 'vackra', 'landskap', '.']
Polyglot Tokens: ['Skåne', 'ligger', 'i', 'södra', 'Sverige', 'och', 'är', 'känt', 'för', 'sina', 'vackra', 'landskap', '.']
SpaCy and Polyglot produce very similar tokenization results for the Swedish text. In this specific example, there are no significant differences in how they handle punctuation, contractions, or other language-specific elements; however, results may vary in larger datasets.
Performance and Scalability
Performance refers to the speed at which each library can process text, while scalability indicates the ability to handle larger datasets without significantly impacting performance. I will use a new, larger sample dataset for this occasion.
sample_data = " ".join(["Jag älskar det här stället. Personalen är mycket vänlig. Det är en fruktansvärd upplevelse. Jag kommer aldrig tillbaka hit. Maten var okej, men servicen kunde varit bättre."] * 1000)
Using a timer module, I can measure the processing time of both libraries.
import time
# Using SpaCy
start_time_spacy = time.time()
doc_spacy = nlp_spacy(sample_data)
end_time_spacy = time.time()
# Using Polyglot
start_time_polyglot = time.time()
doc_polyglot = Text(sample_data, hint_language_code="sv")
end_time_polyglot = time.time()
Let’s compare the results for both libraries.
SpaCy processing time: 1.9483599662780762 seconds
Polyglot processing time: 3.369850158691406 seconds
As it is already known, SpaCy is generally optimized for performance and is faster when processing large amounts of text.
Conclusion
Both libraries offer valuable functionalities in the realm of Natural Language Processing. What is best suited for the user depends entirely on the scope of the project. A summary of strengths and weaknesses:
SpaCy
Strengths:
High performance
Optimized for real-world applications
Wide range of features, including: part-of-speech tagging, dependency parsing, named entity recognition, and text classification.
Weaknesses:
Limited language support compared to Polyglot
No built-in sentiment analysis.
Polyglot
Strengths:
Extensive language support
Features including: tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.
Weaknesses:
May be slower than SpaCy when processing large amounts of text
May have slightly lower accuracy for some tasks.
Comments