top of page
Search
  • Writer's pictureShawna

Polyglot vs. SpaCy for Natural Language Processing in Swedish



Overview

Hej och välkommen!

Two popular NLP libraries in the Python sphere are SpaCy and Polyglot. In this tutorial, the performance of both libraries will be discussed with a particular focus on the Swedish language, showcasing its abilities in tokenization, Part-of-Speech Tagging, and Named Entity Recognition.



 


Introduction to SpaCy and Polyglot

SpaCy is an open-source NLP library that provides a wide range of features, including part-of-speech tagging, dependency parsing, named entity recognition, and text classification. It is a powerful library designed for production environments. With its focus on performance and ease of use, SpaCy has become a popular choice for building applications that require sophisticated text processing, even for less-resourced languages like Swedish.

Polyglot, on the other hand, is an NLP library whose primary selling point is its extensive language coverage, with well over 130 supported languages in its arsenal, including Swedish. It offers features like tokenization, part-of-speech tagging, and named entity recognition. Unlike its SpaCy counterpart, Polyglot offers a sentiment analysis feature.

SpaCy

  • Developed by Explosion AI

  • Open-source software, licensed under the MIT License

  • Targets version 3.4

  • Commonly used for text preprocessing, named entity recognition, part-of-speech tagging, and dependency parsing

  • Supports over 65 languages

  • Known for its speed and efficiency. It is also widely used in industry and has a large community of contributors. The library is easy to use and provides an extensive set of features.


Polyglot

  • Developed by Rami Al-Rfou

  • Open-source software, licensed under the GPLv3 License

  • Targets version 16.07.04

  • Commonly used for language detection, named entity recognition, part-of-speech tagging, sentiment analysis, and word embeddings

  • Supports over 130 languages

  • Known for its wide range of language support. It provides a set of features that are not available in other NLP tools, such as machine translation and language detection. The library is easy to use and provides extensive documentation.


 


Installation and Setup

Make sure you have Python 3 installed on your system. If you don’t have it, you can download it from the official Python website.

Installing SpaCy

pip install spacy


After installing SpaCy, download the Swedish language model.


python -m spacy download sv_core_web_sm


Installing Polyglot

pip install polyglot

Polyglot relies on some additional dependencies that need to be installed separately. Install the following packages:


pip install PyICU
pip install pycld2
pip install morfessor


 

Named Entity Recognition & Part-of-Speech Tagging

To showcase the functionality of both SpaCy and Polyglot, I will perform text preprocessing, named entity recognition (NER), and part-of-speech (POS) tagging using both libraries and compare the results. For the sake of time, only a small sample dataset will be used.

1. Loading a Sample Dataset


sample_data = [
"IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.",
"Stockholm är huvudstad i Sverige.",
"Skåne ligger i södra Sverige och är känt för sina vackra landskap."
]

2. Preprocessing the Text Using SpaCy and Polyglot

Import the necessary modules and load the language model for both libraries.


import spacy
from polyglot.text import Text
nlp_spacy = spacy.load("sv_core_news_sm")

Preprocess the text using both libraries.

# Using SpaCy
spacy_docs = [nlp_spacy(text) for text in sample_data]
# Using Polyglot
polyglot_docs = [Text(text, hint_language_code="sv") for text in sample_data]


3. NER and POS tagging with SpaCy


for doc in spacy_docs:
  print(f"Text: {doc.text}")
  print("Named Entities:")
for ent in doc.ents:
  print(f"{ent.text}: {ent.label_}")
  print("Part-of-Speech Tags:")
for token in doc:
  print(f"{token.text}: {token.pos_}")
  print("\n")

4. NER and POS tagging with Polyglot


for doc in polyglot_docs:
  print(f"Text: {doc.raw}")
  print("Named Entities:")
for entity in doc.entities:
  print(f"{entity}: {entity.tag}")
  print("Part-of-Speech Tags:")
for word, tag in zip(doc.words, doc.pos_tags):
  print(f"{word}: {tag}")
  print("\n")

5. Comparing the Results

SpaCy

Text: IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.
Named Entities:
IKEA: ORG
Älmhult: LOC
Part-of-Speech Tags:
IKEA: PROPN
är: AUX
ett: DET
svenskt: ADJ
möbelföretag: NOUN
med: ADP
huvudkontor: NOUN
i: ADP
Älmhult: PROPN

Polyglot


Text: IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.
Named Entities:
IKEA: I-ORG
Älmhult: I-LOC
Part-of-Speech Tags:
IKEA: NOUN
är: VERB
ett: DET
svenskt: ADJ
möbelföretag: NOUN
med: ADP
huvudkontor: NOUN
i: ADP
Älmhult: NOUN

Both libraries perform well in POS tagging, showing similar accuracy. SpaCy and Polyglot have produced similar NER results, recognizing “IKEA” as an organization and “Älmhult” as a location. The only difference is the way they represent these entities, with SpaCy using “ORG” and “LOC” labels, while Polyglot uses “I-ORG” and “I-LOC.”


 

Tokenization

Tokenization is the process of breaking down text into individual words, phrases, symbols, or other meaningful elements called tokens. I will use the same sample text from the previous section.

1. Dataset


sample_data = [
"IKEA är ett svenskt möbelföretag med huvudkontor i Älmhult.",
"Stockholm är huvudstad i Sverige.",
"Skåne ligger i södra Sverige och är känt för sina vackra landskap."
]

2. Tokenizing the Set


# Using SpaCy

spacy_tokens = [[token.text for token in nlp_spacy(text)] for text in sample_data]

# Using Polyglot

polyglot_tokens = [[word for word in Text(text, hint_language_code="sv").words] for text in sample_data]

for i, (spacy_token_list, polyglot_token_list) in enumerate(zip(spacy_tokens, polyglot_tokens)):
  print(f"Text {i + 1}:")
  print(f"SpaCy Tokens: {spacy_token_list}")
  print(f"Polyglot Tokens: {polyglot_token_list}\n")

3. Comparing the Results


Sample 1:
SpaCy Tokens: ['IKEA', 'är', 'ett', 'svenskt', 'möbelföretag', 'med', 'huvudkontor', 'i', 'Älmhult', '.']
Polyglot Tokens: ['IKEA', 'är', 'ett', 'svenskt', 'möbelföretag', 'med', 'huvudkontor', 'i', 'Älmhult', '.']

Sample 2:
SpaCy Tokens: ['Stockholm', 'är', 'huvudstad', 'i', 'Sverige', '.']
Polyglot Tokens: ['Stockholm', 'är', 'huvudstad', 'i', 'Sverige', '.']

Sample 3:
SpaCy Tokens: ['Skåne', 'ligger', 'i', 'södra', 'Sverige', 'och', 'är', 'känt', 'för', 'sina', 'vackra', 'landskap', '.']
Polyglot Tokens: ['Skåne', 'ligger', 'i', 'södra', 'Sverige', 'och', 'är', 'känt', 'för', 'sina', 'vackra', 'landskap', '.']

SpaCy and Polyglot produce very similar tokenization results for the Swedish text. In this specific example, there are no significant differences in how they handle punctuation, contractions, or other language-specific elements; however, results may vary in larger datasets.


 

Performance and Scalability

Performance refers to the speed at which each library can process text, while scalability indicates the ability to handle larger datasets without significantly impacting performance. I will use a new, larger sample dataset for this occasion.

sample_data = " ".join(["Jag älskar det här stället. Personalen är mycket vänlig. Det är en fruktansvärd upplevelse. Jag kommer aldrig tillbaka hit. Maten var okej, men servicen kunde varit bättre."] * 1000)

Using a timer module, I can measure the processing time of both libraries.


import time
# Using SpaCy
start_time_spacy = time.time()
doc_spacy = nlp_spacy(sample_data)
end_time_spacy = time.time()
# Using Polyglot
start_time_polyglot = time.time()
doc_polyglot = Text(sample_data, hint_language_code="sv")
end_time_polyglot = time.time()

Let’s compare the results for both libraries.


SpaCy processing time: 1.9483599662780762 seconds
Polyglot processing time: 3.369850158691406 seconds

As it is already known, SpaCy is generally optimized for performance and is faster when processing large amounts of text.


 

Conclusion

Both libraries offer valuable functionalities in the realm of Natural Language Processing. What is best suited for the user depends entirely on the scope of the project. A summary of strengths and weaknesses:

SpaCy

Strengths:

  • High performance

  • Optimized for real-world applications

  • Wide range of features, including: part-of-speech tagging, dependency parsing, named entity recognition, and text classification.


Weaknesses:

  • Limited language support compared to Polyglot

  • No built-in sentiment analysis.


Polyglot

Strengths:

  • Extensive language support

  • Features including: tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.


Weaknesses:

  • May be slower than SpaCy when processing large amounts of text

  • May have slightly lower accuracy for some tasks.


Comments


Post: Blog2_Post
bottom of page