Skip to the content.

ir-course-uoi-data

The project for the Information Retrieval course @cse.uoi.gr is about implementing a search engine for Wikipedia articles using Apache Lucene.

Article crawling is performed using crawl-wikipedia.py and is organized in two stages.

Plain text extraction from HTML files is performed by preprocess.py and output text files are stored in corpus/ directory. Because repository/ and corpus/ exceed 1 GB of storage size, corpus/ directory has not been uploaded in git. In ir-course-uoi, the implementation of the search engine has taken place.

Screenshots

scraping-statistics.png preprcessing-statistics.png

License

GNU GENERAL PUBLIC LICENSE Version 2, June 1991