Extract features from a corpus of ancient Greek texts to train machine learning classifiers that distinguish between different genres.
Code: Efthimios Tim Gianitsos
Analysis: Joseph P. Dexter, Pramit Chaudhuri, Thomas J. Bolt
Link to paper: https://www.aclweb.org/anthology/W19-2507/
"Stylometric Classification of Ancient Greek Literary Texts by Genre" LaTeCH-CLfL 2019
- Ensure you have
git
installed with version0 at least1.9
. - Navigate to the project directory. All commands must be run in the project directory.
cd <this project directory>
- This project requires a certain version of
Python
. The following command should output the version number.Determine whether this version is already installed. If you are usinggrep 'python_version' Pipfile | cut -f 2 -d '"'
bash
orzsh
, you can verify this with the following command:If the version ofcommand -v "python`grep 'python_version' Pipfile | cut -f 2 -d '"'`" &> /dev/null; if [[ $? -eq 0 ]]; then echo "$_ currently installed"; else echo "$_ NOT installed"; fi
Python
is not installed, you can install it here: https://www.python.org/downloads/. - Ensure
pipenv
1 is already installed. If you are usingbash
orzsh
, you can verify this with the following command:Ifcommand -v pipenv &> /dev/null; if [[ $? -eq 0 ]]; then echo "$_ currently installed"; else echo "$_ NOT installed"; fi
pipenv
is not installed, then install it with:pip3 install pipenv
- The following command will generate a virtual environment called
.venv/
in the current directory2 that will contain all3 thePython
dependencies for this project.PIPENV_VENV_IN_PROJECT=true pipenv install --dev
- Activate the virtual environment.
pipenv shell
Using exit
will exit the virtual environment i.e. it restores the system-level Python
configurations to your shell. Whenever you want to resume working on the project, run pipenv shell
while in the project directory to activate the virtual environment again.
When installing new dependencies, do not use pip install <dependency name>
. Instead, use pipenv install <dependency name>
. This is because pipenv
updates Pipfile
and Pipfile.lock
, but pip
does not. Having these files match the state of your virtual environment ensures that anyone else who starts the project will have a virtual environment that looks exactly like yours.
Use pipenv check
to ensure that the virtual environment is in a stable condition. If not, then run pipenv install --dev
while your virtual environment is activated. This should ensure that your project has all the necessary dependencies.
Extract features from all files:
python run_feature_extraction.py all_data.pickle
Extract features from only drama and epic files:
python run_feature_extraction.py drama_epic_data.pickle drama epic
Run all model analyzer functions on the data from all files to classify prose from verse:
python run_ml_analyzers.py all_data.pickle labels/prosody_labels.csv all
Run all model analyzer functions on the data from only drama and epic files to classify drama from epic:
python run_ml_analyzers.py drama_epic_data.pickle labels/genre_labels.csv all
0) The project uses the git
protocol to download the corpus. We make use of git
's sparse checkout and shallow clone features to download only what we need from the repository (this is done automatically in the code). We must have at least git
version 1.9 to perform a sparse checkout and shallow clone.↩
1) The pipenv
tool manages project dependencies. It works by making a project-specific directory called a virtual environment that holds the dependencies for that project. After a virtual environment is "activated", Python
commands will ignore the system-level Python
version & dependencies. Only the version & dependencies in the virtual environment will be recognized. Also, newly installed dependencies will automatically go into the virtual environment instead of being placed among your system-level Python
dependencies. This precludes the possiblity of different projects on the same system from having dependencies that conflict with one another. It also makes it easier to clean up after deleting a project: instead of remembering to uninstall several dependencies from your system, you can just delete the virtual environment.↩
2) Setting the PIPENV_VENV_IN_PROJECT
variable to true will indicate to pipenv
to make this virtual environment .venv/
within the project directory so that all the files corresponding to a project can be in the same place. This is not default behavior (e.g. on Mac, the environments will normally be placed in ~/.local/share/virtualenvs/
by default). Note that the virtual environment .venv/
should never be moved because some of the scripts it runs use absolute paths. Therefore, if the project directory that it is inside is ever renamed or moved, the virtual environment will no longer work correctly. This can be remedied by deleting the .venv/
and generating it again with PIPENV_VENV_IN_PROJECT=true pipenv install --dev
↩
3) Using --dev
ensures that even development dependencies will be installed (dev dependencies may include testing and linting frameworks which are not necessary for normal execution of the code). Pipfile.lock
specifies the dependencies and exact versions (for both dev dependencies and regular dependencies) for the virtual environment. After installation, you can find all dependencies in <path to virtual environment>/lib/python<python version>/site-packages/
. ↩