solr-wikipedia

💡 This repository has been created in the scope of a hackathon. It is not actively developed or used at the moment.

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

Quick-Start

Download a Wikipedia dump file (http://en.wikipedia.org/wiki/Wikipedia:Database_download)
Download Solr 4.9 and extract (http://lucene.apache.org/solr/)
Configure environment variables

Set SOLR_HOME to the location Solr was extracted to in Step 2 + "example", for example: export SOLR_HOME=/var/local/solr/example

Set JAVA_HOME to the location of your JDK.
Clone and build code

git clone https://github.com/bbende/solr-wikipedia.git

cd solr-wikipedia

mvn clean package -Pshade
Configure & start Solr

./deploy-wikipedia-collection.sh (copies src/main/resource/solr/wikiepediaCollection to $SOLR_HOME/solr/)

src/main/resources/solr.sh start

Check http://localhost:8983/solr in your browser
Ingest data (from solr-wikipedia dir)

java -jar target/solr-wikipeida-1.0-SNAPSHOT.jar http://localhost:8983/solr/wikipediaCollection /var/local/test-wiki-data.xml.bz2

Overview

There are three main concepts:

Handlers - Receive events related to the WikiMedia XML and produce objects based on those events. The DefaultHandler produces Page objects, but clients could implement a custom handler to produce another type of object.
Parser - A SAX parser for the WikiMedia XML. Clients pass in a Reader for the XML and a handler to take action on events.
Iterator - An Iterator that uses StAX processing to produces objects based on the given handler.

An example of parsing a bzip dump file:


    String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";

    WikiMediaXMLParser wikiMediaXMLParser = new SAXWikiMediaParser<>();
    PageHandler handler = new DefaultPageHandler();

    try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
         BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
         InputStreamReader reader = new InputStreamReader(bzipIn)) {

        wikiMediaXMLParser.parse(reader, handler);
        ...
    }

An example of iterating over a bzip dump file:


    String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";

    try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
         BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
         InputStreamReader reader = new InputStreamReader(bzipIn)) {

        PageHandler handler = new DefaultPageHandler();

        Iterator iterator = new WikiMediaIterator<>(
                reader, handler);

        while(iterator.hasNext()) {
            Page page = iterator.next();
        }
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

solr-wikipedia

Quick-Start

Overview

Files

README.md

Latest commit

History

README.md

File metadata and controls

solr-wikipedia

Quick-Start

Overview