DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Art performance on all Benchmark tasks for Danish. This repository contains code for reproducing DaCy as well as download and loading the models. Furthermore it also contains guides on how to use DaCy.
it currently only possible to download DaCy directly from GitHub, however this can be done quite easily using:
pip install git+https://github.com/KennethEnevoldsen/DaCy
To use the model you first have to download either the medium
or large
model. To see a list of all available models:
import dacy
for model in dacy.models():
print(model)
# da_dacy_-l-ctra_small_tft-0.0.0
# da_dacy_medium_tft-0.0.0
# da_dacy_large_tft-0.0.0
To download and load a model simply execute:
nlp = dacy.load("da_dacy_medium_tft-0.0.0")
Which will download the model to the .dacy
directory in your home directory. To figure out where this you can always use:
where_is_my_dacy()
which for me will return '/Users/kenneth/.dacy'
.
To download the model to a specific directory:
dacy.download_model("da_dacy_medium_tft-0.0.0", your_save_path)
nlp = dacy.load_model("da_dacy_medium_tft-0.0.0", your_save_path)
DaCy also include a Jupyter notebook tutorial. If you do not have Jupyter Notebook installed, instructions for installing and running it can be found here. All the tutorial are located in the tutorials
folder.
Tutorial | Content | file name | Google Colab |
---|---|---|---|
Introduction | A simple introduction to SpaCy and DaCy. For a more detailed instruction I recommend the course by SpaCy themselves. | dacy-spacy-tutorial.ipynb | |
Sentiment | A simple introduction to the new sentiment features in DaCy. | dacy-sentiment.ipynb | |
wrapping a fine-tuned Tranformer | A guide on how to wrap an already fine-tuned transformer to and add it to your SpaCy pipeline using DaCy helper functions. | dacy-wrapping-a-classification-transformer.ipynb |
The following table shows the performance on the DaNE dataset when compared to other models. Highest scores are highlighted with bold and second highest is underlined.
Want to learn more about how the model was trained, check out this blog post.
the folder DaCy_training
contains a SpaCy project which will allow for a reproduction of the results. This folder also includes the evaluation metrics on DaNE and scripts for downloading the required data. For more information please see the training readme.
To ask questions, report issues or request features, please use the GitHub Issue Tracker. Question related to SpaCy is kindly referred to the SpaCy GitHub or forum.
Why doesn't the performance metrics match the performance metrics reported on the DaNLP GitHub? The performance metrics by DaNLP gives the model the 'gold standard' tokenization of the dataset as opposed to having the pipeline tokenize the models itself. This allows for comparison of the models on an even ground, but inflated the performance in general. DaCy on the other hand reports the performance metrics using its own tokenization this makes the result closer to something you would see on a real dataset and does reflect that tokenization influence your performance.
This is really an acknowledgement of great open-source software and contributors. This wouldn't have been possible with the work by the SpaCy team which developed an integrated the software. Huggingface for developing Transformers and making model sharing convenient. BotXO for training and sharing the Danish BERT model and Malte Hojmark-Bertelsen for making it easily available. DaNLP has made it extremely easy to get access to Danish resources to train on and even supplied some of the tagged data themselves and have done great job of developing these datasets.
If you use this library in your research, please kindly cite:
@inproceedings{enevoldsen2020dacy,
title={DaCy: A SpaCy NLP Pipeline for Danish},
author={Enevoldsen, Kenneth},
year={2021}
}
DaCy is released under the Apache License, Version 2.0. See the LICENSE
file for more details.
To contact the author feel free to use the application form on my website or contact me on social media. Please note that for issues and bugs please use the GitHub Issue Tracker.