Skip to content

DaCy: The State of the Art Danish NLP pipeline using SpaCy

License

Notifications You must be signed in to change notification settings

jonasherfort/DaCy

 
 

Repository files navigation

DaCy: A SpaCy NLP Pipeline for Danish

release version python version Code style: black license github actions spacy

DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Art performance on all Benchmark tasks for Danish. This repository contains code for reproducing DaCy as well as download and loading the models. Furthermore it also contains guides on how to use DaCy.

🔧 Installation

it currently only possible to download DaCy directly from GitHub, however this can be done quite easily using:

pip install git+https://github.com/KennethEnevoldsen/DaCy

👩‍💻 Usage

To use the model you first have to download either the medium or large model. To see a list of all available models:

import dacy
for model in dacy.models():
    print(model)
# da_dacy_-l-ctra_small_tft-0.0.0
# da_dacy_medium_tft-0.0.0
# da_dacy_large_tft-0.0.0

To download and load a model simply execute:

nlp = dacy.load("da_dacy_medium_tft-0.0.0")

Which will download the model to the .dacy directory in your home directory. To figure out where this you can always use:

where_is_my_dacy()

which for me will return '/Users/kenneth/.dacy'.

To download the model to a specific directory:

dacy.download_model("da_dacy_medium_tft-0.0.0", your_save_path)
nlp = dacy.load_model("da_dacy_medium_tft-0.0.0", your_save_path)

DaCy also include a Jupyter notebook tutorial. If you do not have Jupyter Notebook installed, instructions for installing and running it can be found here. All the tutorial are located in the tutorials folder.

Tutorial Content file name Google Colab
Introduction A simple introduction to SpaCy and DaCy. For a more detailed instruction I recommend the course by SpaCy themselves. dacy-spacy-tutorial.ipynb Open In Colab
Sentiment A simple introduction to the new sentiment features in DaCy. dacy-sentiment.ipynb Open In Colab
wrapping a fine-tuned Tranformer A guide on how to wrap an already fine-tuned transformer to and add it to your SpaCy pipeline using DaCy helper functions. dacy-wrapping-a-classification-transformer.ipynb Open In Colab

🦾 Performance and Training

The following table shows the performance on the DaNE dataset when compared to other models. Highest scores are highlighted with bold and second highest is underlined.

Want to learn more about how the model was trained, check out this blog post.

Training and reproduction

the folder DaCy_training contains a SpaCy project which will allow for a reproduction of the results. This folder also includes the evaluation metrics on DaNE and scripts for downloading the required data. For more information please see the training readme.

🤔 Issues and Usage Q&A

To ask questions, report issues or request features, please use the GitHub Issue Tracker. Question related to SpaCy is kindly referred to the SpaCy GitHub or forum.

FAQ

Why doesn't the performance metrics match the performance metrics reported on the DaNLP GitHub? The performance metrics by DaNLP gives the model the 'gold standard' tokenization of the dataset as opposed to having the pipeline tokenize the models itself. This allows for comparison of the models on an even ground, but inflated the performance in general. DaCy on the other hand reports the performance metrics using its own tokenization this makes the result closer to something you would see on a real dataset and does reflect that tokenization influence your performance.

Acknowledgements

This is really an acknowledgement of great open-source software and contributors. This wouldn't have been possible with the work by the SpaCy team which developed an integrated the software. Huggingface for developing Transformers and making model sharing convenient. BotXO for training and sharing the Danish BERT model and Malte Hojmark-Bertelsen for making it easily available. DaNLP has made it extremely easy to get access to Danish resources to train on and even supplied some of the tagged data themselves and have done great job of developing these datasets.

References

If you use this library in your research, please kindly cite:

@inproceedings{enevoldsen2020dacy,
    title={DaCy: A SpaCy NLP Pipeline for Danish},
    author={Enevoldsen, Kenneth},
    year={2021}
}

License

DaCy is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Contact

To contact the author feel free to use the application form on my website or contact me on social media. Please note that for issues and bugs please use the GitHub Issue Tracker.

KCEnevoldsen | Twitter KennethEnevoldsen | LinkedIn


About

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 54.7%
  • Python 43.6%
  • R 1.6%
  • Shell 0.1%