The goal of this project is to implement a basic CBIR using Lucene
The project is divided in the following java applications according to the parts in which the assignment is divided (assignment.pdf)
Application to index and search documents in Lucene. It has to modes as it is described below with different arguments.
- Modes: 1 (Indexer)
- dataDir: The directory path with the data to be indexed.
- indexDir: The directory path to store the indexed data.
2 (Searcher)
- indexDir: The directory path where is stored the indexed data.
- word: The word to search.
Application to obtain text from images using tesseract API. The extraction is optimized using layouts and indexed using LuceneApplication described above.
- dataDir: The directory path to read images.
- outDir: The directory path to store the text in json format.
Application to extract metadata information from png images. The extracted metadata is stored using json format and indexed using LuceneApplication.
- dataDir: The directory path to read images.
- outDir: The directory path to store the text in json format.
Application to extract basic color features (histogram, mean and mode) from images. The extracted metadata is stored using json format and indexed using LuceneApplication.
- dataDir: The directory path to read images.
- outDir: The directory path to store the text in json format.
Application to obtain text from image headers in DICOM format. The extracted metadata is stored using json format and indexed using LuceneApplication.
- dataDir: The directory path to read DICOM files.
- outDir: The directory path to store the text in json format.
Application to integrate indexing of DICOM images and OCR using Apache SOLR.
- collection: Collection name
- dataDir: The directory path with source data.
Run config.py:
python config.py dataDir collection
Folders:
/LuceneApplication. Contains the source code to index text using Lucene
/OCRApplication. Contains the source code to obtain text from images using tesseract
/FLickrImageExtraction.
/ImageMetadataExtractor. Contains the source code to extract metadata (using metadata-extractor)
/ImageFeatureExtractor. Contains the source code to extract basic shape and color features from flickr images.
/DICOMImagesExtraction.
/SOLRApplication. Contains the source code to integrate indexing of DICOM images and OCR using SOLR