Difference between revisions of "Overview"

From DocDataFlow
Jump to: navigation, search
Line 2: Line 2:
  
 
Crawler is designed along principles that are similar to the ones found in the [[http://en.wikipedia.org/wiki/Dataflow_programming Data Flow Programming]] paradigm.
 
Crawler is designed along principles that are similar to the ones found in the [[http://en.wikipedia.org/wiki/Dataflow_programming Data Flow Programming]] paradigm.
 +
 +
Crawler all by itself does not perform any useful function. In order to become usable it needs to be extended with a personality. The personality determines what function Crawler will perform.
  
 
== Personality ==
 
== Personality ==

Revision as of 20:14, 26 December 2013

Overview

Crawler is designed along principles that are similar to the ones found in the [Data Flow Programming] paradigm.

Crawler all by itself does not perform any useful function. In order to become usable it needs to be extended with a personality. The personality determines what function Crawler will perform.

Personality

One of the high-level components in a Crawler-based system is called a personality.

A Personality is a Crawler component which will take input data in some shape or form, and will process it into output data in some other form.

A few examples:

  • InDesign-to-XHTML/CSS: takes in InDesign documents or books and outputs XHTML/CSS files.
  • InDesign-to-EPUB: takes in InDesign documents or books, outputs EPUB.
  • InDesign-to-Database input: takes in InDesign document or books, and updates a database with information extracted from the document(s).

Personalities are constructed out of simpler elements.

A personality is composed of:

When processing, input document(s) are pushed through the network of adapters provided by the personality; data is flowing in and out of the adapters. A personality is somewhat reminiscent of a Rube Goldberg-machine.

If the personality were a hive, then the adapters would be the worker bees.

The initial adapters process the document, and take it apart into ever smaller chunks of data, or collate smaller chunks back into larger chunks. These chunks of data are called granules.

More adapters take in larger granules and split them up into smaller granules (e.g. they might take in a paragraph granule and split it into individual word granules).

Specific adapters collate smaller granules back into larger granules (e.g. they might take a number of word granules and concatenate them back into a paragraph granules).

Some adapters perform some kind of processing on the granules they receive; they might change them in some way, discard them, count them, reorder them, create new granules based on previous granules...

Other adapters construct new granules based on template snippets. For example, some adapter could take in some raw text, and combine this raw text with a template snippet into some XML formatted granule.

The general idea is that the input data is broken apart into smaller entities, and then these smaller entities are put back together again a different shape, possibly performing a document conversion in the process.

Formula Files

Formula files in Crawler are JavaScript-like files which allow defining placeholders as JavaScript functions.