<<O>>  Difference Topic Galago (r1.30 - 23 Jul 2008 - HenryFeild)

META TOPICPARENT Software

Galago

Changed:
<
<
(UNDER CONSTRUCTION)
>
>
(UNDER CONSTRUCTION)

Created by Trevor Strohman
Line: 28 to 28

Added:
>
>

Line: 120 to 121

  • Pig-Galago (A patched version of com.yahoo.pig version 1.0 ONLY. Pig 1.2+ are not supported with TupleFlow).
  • edu.umass (Includes a distributed retrieval framework and custom operators built on top of Galago)
Changed:
<
<
The source code for each of these is maintained in a separate repository; however, there is an additional repository which stores the most recent Jar files, including libraries, from the other three. The latest version of any of these can be checked out using the following commands (you must have an account on Sydney to access these):
>
>
The source code for each of these is maintained in a separate repository; however, there is an additional repository which stores the most recent Jar files, including libraries, from new_galago and edu.umass. The latest version of any of these can be checked out using the following commands (you must have an account on Sydney to access these):

svn co svn+ssh://sydney.cs.umass.edu/home/hfeild/svn/galago/tags/latest galago
svn co svn+ssh://sydney.cs.umass.edu/home/hfeild/svn/pig_galago/tags/latest pig_galago
svn co svn+ssh://sydney.cs.umass.edu/home/hfeild/svn/edu.umass/tags/latest edu.umass
Changed:
<
<
svn co svn+ssh://sydney.cs.umass.edu/home/hfeild/svn/ciir_galago/tags/latest ciir_galago
>
>
svn co svn+ssh://sydney.cs.umass.edu/home/hfeild/svn/ciir_galago_bin/tags/latest ciir_galago_bin

Line: 152 to 153

Changed:
<
<
The ciir_galag package is just a directory of Jar files, so no building is necessary.
>
>
The ciir_galago_bin package has a directory of Jar files, so no building is necessary. It also has a directory of sample Galago parameter files and a scripts/ directory. See the README it contains for more information.

Next, you will need to set up your CLASSPATH environment variable so that Java will know where to find the Jar files you checked out.
Line: 1213 to 1216

Added:
>
>

Running TupleFlow

To run TupleFlow on the above parameter file (or on one that you have created), do one of the following:

Local (non-distributed):

mkdir /path/to/my/tmp/dir
java -Xmx900m galago.tupleflow.execution.JobExecutor \
      local paramFile.xml  /path/to/my/tmp/dir

DRMAA (distributed):

mkdir /path/to/my/tmp/dir
java -Xmx900m galago.tupleflow.execution.JobExecutor \
      drmaa paramFile.xml  /path/to/my/tmp/dir

If you are running in local mode, all of the logging messages and any errors will be displayed to the stderr on your screen. However, if you use the distributed mode, these messages are all kept in files within the temporary directory (in the example above, these files are located in: /path/to/my/tmp/dir/stderr/.

Galago keeps track of which stages fail and succeed. A quick way to check is to do:

ls /path/to/my/tmp/dir/jobs/*/

This will list the jobs that each stage was broken into (for each stage, these are numbered from 0 to the max number of jobs a stage could be split into, based on the hashCount specified at the top of the parameter file). Stages that completed successfully will have an addition file with the job number followed by .complete. If the job failed, it will be followed by .error.

If you are able to fix the bug that caused the error, and the changes you made only affect the data flow from where the errors occurred onwards, then if you pass the same temporary directory to TupleFlow, it won't redo the stages that completed successfully, but rather start at the failed stages.

However, if you do change something in your code that affects the flow of data from stages that have already finished successfully, or if you want a clean start, delete the contents of the temporary directory before running TupleFlow.


Indexing on Sydney / Swarm

Line: 1224 to 1283

java -Xmx900m galago.tupleflow.execution.JobExecutor \
Changed:
<
<
local index_param_file.xml tmp_files/
>
>
local index_param_file.xml /path/to/tmp/dir/

Line: 1234 to 1293

java -Xmx900m galago.tupleflow.execution.JobExecutor \
Changed:
<
<
drmaa index_param_file.xml tmp_files/
>
>
drmaa index_param_file.xml /path/to/tmp/dir/

 <<O>>  Difference Topic Galago (r1.25 - 03 Jun 2008 - HenryFeild)

META TOPICPARENT Software
Changed:
<
<

Galago (UNDER CONSTRUCTION)

>
>

Galago

(UNDER CONSTRUCTION)

Created by Trevor Strohman
Line: 47 to 48

Galago is made of up several components. The most powerful component is TupleFlow. TupleFlow is a kind of MapReduce framework that allows

Changed:
<
<
for the stages of the Map-Reduce to have multiple inputs and outputs. Trevor
>
>
for the stages of the map-reduce to have multiple inputs and outputs. Trevor

describes TupleFlow as, "a mix between MapReduce, make/ant, and a database system. TupleFlow is like MapReduce in that it can efficiently parallelize a large computation. It is like make or ant in that it runs based on a file that
Line: 57 to 58

It is on top of the TupleFlow framework that Galago's indexers are built. The indexers are made of several Java classes which can be used in TupleFlow stages; depending on which ones you combine, you can make a tradition indexer (i.e. one that produces
Changed:
<
<
an inverted index) or a query likelihood binned indexer, among other.
>
>
an inverted index) or a query likelihood binned indexer, among others.

These are examples of the type of applications that TupleFlow can enhance.
Changed:
<
<
Galago's third most useful component is the retrieval system.
>
>
Another useful component of Galago is the retrieval system.

This does not rely on TupleFlow, but it knows how to read and interact with the
Changed:
<
<
inverted indexes created by Galago's indexer. It is also easily extended and
>
>
inverted indexes created by Galago's indexers. It is also easily extended and

is a great way to prototype new retrieval operators.

There are other features that come with Trevor Strohman's release of Galago. However,

Line: 84 to 85

Trevor's Galago Branch

Changed:
<
<
Trevor's branch is available via this Git repository. To checkout a copy from here, you'll first need to get Git, which is available here. It is pretty easy to install.
>
>
Trevor's branch is available via this Git repository. To checkout a copy, you'll first need to get Git, which is available here. It is pretty easy to install.

Once you have Git, you can download the current version of Galago by issuing the command:

Line: 92 to 93

git clone git://repo.or.cz/galago.git
Changed:
<
<
For information about what is included in Trevor's Galago branch and how to use it, see Galago Guidebook.
>
>
For information about what is included in Trevor's Galago branch and how to use it, see the Galago Guidebook.

<!-- That should create a directory called galago with all of the Galago files in it. Change directories to:

Line: 117 to 118

  • Galago (TupleFlow, TREC indexing classes, and a retrieval framework)
  • Pig-Galago (A patched version of com.yahoo.pig version 1.0 ONLY. Pig 1.2+ are not supported with TupleFlow).
Changed:
<
<
  • edu.umass (Included a distributed retrieval framework and custom operators built on top of Galago)
>
>
  • edu.umass (Includes a distributed retrieval framework and custom operators built on top of Galago)

The source code for each of these is maintained in a separate repository; however, there is an additional repository which stores the most recent Jar files, including libraries, from the other three. The latest version of any of these can be checked out using the following commands (you must have an account on Sydney to access these):

 <<O>>  Difference Topic Galago (r1.24 - 01 Jun 2008 - HenryFeild)

META TOPICPARENT Software