DaCy: A SpaCy NLP Pipeline for Danish

DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Art performance on all Benchmark tasks for Danish. This repository contains code for reproducing DaCy as well as download and loading the models. Furthermore it also contains guides on how to use DaCy.

🔧 Installation

it currently only possible to download DaCy directly from GitHub, however this can be done quite easily using:

pip install git+https://github.com/KennethEnevoldsen/DaCy

👩‍💻 Usage

To use the model you first have to download either the medium or large model. To see a list of all available models:

import dacy
for model in dacy.models():
    print(model)
# da_dacy_-l-ctra_small_tft-0.0.0
# da_dacy_medium_tft-0.0.0
# da_dacy_large_tft-0.0.0

To download and load a model simply execute:

nlp = dacy.load("da_dacy_medium_tft-0.0.0")

Which will download the model to the .dacy directory in your home directory. To figure out where this you can always use:

where_is_my_dacy()

which for me will return '/Users/kenneth/.dacy'.

To download the model to a specific directory:

dacy.download_model("da_dacy_medium_tft-0.0.0", your_save_path)
nlp = dacy.load_model("da_dacy_medium_tft-0.0.0", your_save_path)

DaCy also include a Jupyter notebook tutorial. If you do not have Jupyter Notebook installed, instructions for installing and running it can be found here. All the tutorial are located in the tutorials folder.

Tutorial	Content	file name
Introduction	A simple introduction to SpaCy and DaCy. For a more detailed instruction I recommend the course by SpaCy themselves.	dacy-spacy-tutorial.ipynb
Sentiment	A simple introduction to the new sentiment features in DaCy.	dacy-sentiment.ipynb
wrapping a fine-tuned Tranformer	A guide on how to wrap an already fine-tuned transformer to and add it to your SpaCy pipeline using DaCy helper functions.	dacy-wrapping-a-classification-transformer.ipynb

🦾 Performance and Training

The following table shows the performance on the DaNE dataset when compared to other models. Highest scores are highlighted with bold and second highest is underlined.

Want to learn more about how the model was trained, check out this blog post.

Training and reproduction

the folder DaCy_training contains a SpaCy project which will allow for a reproduction of the results. This folder also includes the evaluation metrics on DaNE and scripts for downloading the required data. For more information please see the training readme.

🤔 Issues and Usage Q&A

To ask questions, report issues or request features, please use the GitHub Issue Tracker. Question related to SpaCy is kindly referred to the SpaCy GitHub or forum.

FAQ

Why doesn't the performance metrics match the performance metrics reported on the DaNLP GitHub? The performance metrics by DaNLP gives the model the 'gold standard' tokenization of the dataset as opposed to having the pipeline tokenize the models itself. This allows for comparison of the models on an even ground, but inflated the performance in general. DaCy on the other hand reports the performance metrics using its own tokenization this makes the result closer to something you would see on a real dataset and does reflect that tokenization influence your performance.

Acknowledgements

This is really an acknowledgement of great open-source software and contributors. This wouldn't have been possible with the work by the SpaCy team which developed an integrated the software. Huggingface for developing Transformers and making model sharing convenient. BotXO for training and sharing the Danish BERT model and Malte Hojmark-Bertelsen for making it easily available. DaNLP has made it extremely easy to get access to Danish resources to train on and even supplied some of the tagged data themselves and have done great job of developing these datasets.

References

If you use this library in your research, please kindly cite:

@inproceedings{enevoldsen2020dacy,
    title={DaCy: A SpaCy NLP Pipeline for Danish},
    author={Enevoldsen, Kenneth},
    year={2021}
}

License

DaCy is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Contact

To contact the author feel free to use the application form on my website or contact me on social media. Please note that for issues and bugs please use the GitHub Issue Tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
.vscode		.vscode
DaCy_training		DaCy_training
dacy		dacy
dev		dev
img		img
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
gitignore		gitignore
requirements.txt		requirements.txt
setup.py		setup.py
upload_to_pypi.sh		upload_to_pypi.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DaCy: A SpaCy NLP Pipeline for Danish

🔧 Installation

👩‍💻 Usage

🦾 Performance and Training

Training and reproduction

🤔 Issues and Usage Q&A

FAQ

Acknowledgements

References

License

Contact

About

Releases

Packages

Languages

License

jonasherfort/DaCy

Folders and files

Latest commit

History

Repository files navigation

DaCy: A SpaCy NLP Pipeline for Danish

🔧 Installation

👩‍💻 Usage

🦾 Performance and Training

Training and reproduction

🤔 Issues and Usage Q&A

FAQ

Acknowledgements

References

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages