https://www.docdataflow.com/wiki/index.php?title=Special:NewPages&feed=atom&hidebots=1&hideredirs=1&limit=50&offset=&namespace=0&username=&tagfilter=DocDataFlow - New pages [en]2024-03-29T11:03:34ZFrom DocDataFlowMediaWiki 1.22.1https://www.docdataflow.com/wiki/index.php/MarkdownMarkdown2014-06-05T06:31:12Z<p>Kris: /* config.ini settings */</p>
<hr />
<div>''Crawler.ID2MD'' is a Crawler-based software product which provides InDesign to Markdown export. ''Crawler.ID2MD'' is a temporary name; we will eventually come up with some better name.<br />
<br />
The Markdown syntax definitions can be found here:<br />
<br />
http://daringfireball.net/projects/markdown/syntax <br />
<br />
A .zip file with some sample documents and the Markdown conversions can be downloaded here:<br />
<br />
[http://www.docdataflow.com/wiki/images/2/2f/Crawler.ID2MD.Samples.zip Crawler.ID2MD.Samples.zip]<br />
<br />
[[File:Screen Shot 2014-06-06 at 12.03.49 AM.png]]<br />
<br />
An early product preview can be made available upon request. Email [mailto:dev@rorohiko.com dev@rorohiko.com] if you want to test it out.<br />
<br />
To install, you need to first decompress the .zip archive and get to a folder ''Crawler.ID2MD'' which contains a file called ''Export.jsxbin''.<br />
<br />
''Export.jsxbin'' is the script you'll need to run to activate Crawler.ID2MD.<br />
<br />
== Running ==<br />
<br />
Open an InDesign document, and double-click ''Export.jsxbin'' on the ''Scripts'' panel.<br />
<br />
A new file with the same name as the original document should appear. The file name extension of the new file is ''.md'' by default. If your InDesign document is called ''MyDocument.indd'' you should see ''MyDocument.md'' appear next to it.<br />
<br />
If the InDesign document has graphics in it, these will be exported into a separate folder called ''images'' (by default).<br />
<br />
== Configuration ==<br />
<br />
''Crawler.ID2MD'' can be configured through a number of configuration files. <br />
<br />
=== config.ini ===<br />
<br />
There are two files named ''config.ini''. One resides next to ''Export.jsxbin'' and configures the Crawler system; you will rarely, if ever need to change this file. <br />
<br />
The second ''config.ini'' resides in the ''Personalities/Markdown'' subfolder and configures the Markdown features; this file will be where most configuration will be done.<br />
<br />
These ''config.ini'' files can be opened with a standard text editor (e.g. TextWrangler on Macintosh: http://www.barebones.com/products/textwrangler/ or Notepad++ on Windows: http://notepad-plus-plus.org/download/ ).<br />
<br />
One trick to quickly navigate to the ''config.ini'' file is to disclose it on the ''Scripts'' panel, right-click or <Control>-click it, and select 'Reveal in Finder' or 'Reveal in Explorer'.<br />
<br />
[[File:Install06.png|330px]]<br />
<br />
Once you see the file, use a text editor to open it.<br />
<br />
The most relevant configurations are described further down.<br />
<br />
=== Snippets ===<br />
<br />
The folder ''Personalities/Markdown/markdownSnippets'' contains a number of template files that are used to generate the Markdown output.<br />
<br />
These files can all be opened with a standard text editor.<br />
<br />
For example, ''document.md.snippet'' contains:<br />
<br />
[//]: # (Document markdown file $$MARKDOWN_FILENAME$$ generated by $$SOFTWARE$$ $$VERSION$$)<br />
$$INPUT_TEXT$$<br />
<br />
Any text between two $$ is a placeholder which will be replaced by some calculated strings.<br />
<br />
The various snippets represent different concentric layers of complexity. At the inner level is the ''text.run.md.snippet''. This snippet is used to format individual style runs from the original document. <br />
<br />
The next snippet ''text.paragraph.md.snippet'' will take the collated/concatenated output from the ''text.run.md.snippet'' snippet. <br />
<br />
The output of each type of snippet is concatenated, and 'handed up' to the next level of snippet, until eventually, the output is passed through the ''document.md.snippet'' where it gets its final shape.<br />
<br />
==== text.run.md.snippet ====<br />
<br />
$$TEXT_RUN_STYLE_FORMAT_PREFIX$$$$INPUT_TEXT$$$$TEXT_RUN_STYLE_FORMAT_SUFFIX$$<br />
<br />
The expressions $$TEXT_RUN_STYLE_FORMAT_PREFIX$$ and $$TEXT_RUN_STYLE_FORMAT_SUFFIX$$ normally are replaced with nothing, except when the style run is bold or italic, when they become **, _ or **_. <br />
<br />
Bold style runs are prefixed and suffixed with '**', italic style runs are prefixed and suffixed with _ and bold italic style runs are prefixed and suffixed with _** and **_.<br />
<br />
$$INPUT_TEXT$$ is replaced by the text of the style run extracted from the original document. This text will also have some special characters escaped, e.g. ! is converted to \!, # is converted to \# and so on, as to make sure these characters are not mistaken for Markdown syntax.<br />
<br />
==== text.paragraph.md.snippet ====<br />
<br />
$$HEADING_PREFIX$$$$LINE_BROKEN_INPUT_TEXT$$<br />
<br />
This snippet is 'one level up' from the previous one. $$LINE_BROKEN_INPUT_TEXT$$ is replaced by the collated text of all the style runs in the paragraph, after which 'soft returns' are prefixed with two extra spaces, so Markdown preserves soft line breaks.<br />
<br />
$$LINE_BROKEN_INPUT_TEXT$$ is calculated by a formula based on $$INPUT_TEXT$$, where $$INPUT_TEXT$$ is simply the concatenated/collated text of all text runs in the paragraph.The formula can be found in ''Personalities/Markdown/formulas/paragraph.jsx.snippet'':<br />
<br />
<nowiki>/ ****************<br />
<br />
$$LINE_BROKEN_INPUT_TEXT$$ =<br />
{<br />
var retVal = null;<br />
do<br />
{<br />
var inputText = $$INPUT_TEXT$$;<br />
if (! inputText || inputText == "")<br />
{<br />
break;<br />
}<br />
<br />
retVal = inputText.replace(/^\s+/,"");<br />
retVal = retVal.replace(/([\n\r])\s+/g,"$1");<br />
retVal = retVal.replace(/\s*([\n\r])/g," $1");<br />
}<br />
while (false);<br />
<br />
return retVal;<br />
}<br />
<br />
// ****************<br />
</nowiki><br />
<br />
==== frame.text.md.snippet ====<br />
<br />
$$INPUT_TEXT$$<br />
<br />
This snippet is again 'one level out' from the previous ''text.paragraph.md.snippet'' snippet. $$INPUT_TEXT$$ is replaced by the collated text of all the paragraphs in the text frame. This particular snippet boils down to a 'do nothing': simply pass on the data received by collating the paragraphs.<br />
<br />
==== frame.graphic.md.snippet ====<br />
<br />
$$FRAME_PREFIX$$![$$FRAME_IMAGE_NAME$$]($$FRAME_IMAGE_PATH$$)$$FRAME_SUFFIX$$<br />
<br />
This snippet is on the same level - where the previous snippet is used for text frames, this one is used for graphical frames. <br />
<br />
$$FRAME_PREFIX$$ and $$FRAME_SUFFIX$$ are replaced by nothing for anchored and inline graphics, so the graphic is displayed in-line in Markdown. <br />
<br />
For 'floating' frames, these are instead replaced by a few extra newlines, to separate the graphic from the text.<br />
<br />
$$FRAME_IMAGE_NAME$$ is replaced by the name of the image, and $$FRAME_IMAGE_PATH$$ becomes the relative path of the exported graphic.<br />
<br />
==== document.md.snippet ====<br />
<br />
This is the 'top level' snippet.<br />
<br />
[//]: # (Document markdown file $$MARKDOWN_FILENAME$$ generated by $$SOFTWARE$$ $$VERSION$$)<br />
$$INPUT_TEXT$$<br />
<br />
$$INPUT_TEXT$$ is replaced by the collation of all lower-level snippets.<br />
<br />
The first line in this snippet is a Markdown comment line with some meta-info about the document.<br />
<br />
=== Constants and Formulas ===<br />
<br />
In the snippets, there are a lot of references to placeholders between two $$. <br />
<br />
Many of these placeholders are calculated automatically by Crawler, but it is also possible to define custom placeholders, either as configuration constants or as calculated formulas.<br />
<br />
==== Constants ==== <br />
<br />
Any entry made in the ''[appContextData]'' in the config.ini becomes a placeholder. For example, the entry ''MARKDOWN_FILENAME_EXTENSION''<br />
<br />
[appContextData]<br />
MARKDOWN_FILENAME_EXTENSION = .md<br />
<br />
causes a placeholder $$MARKDOWN_FILENAME_EXTENSION$$ to become available for use in the snippets.<br />
<br />
If you were to add a new line, for example <br />
<br />
[appContextData]<br />
MARKDOWN_FILENAME_EXTENSION = .md<br />
AUTHOR_NAME = "John Doe"<br />
<br />
you could start using a placeholder $$AUTHOR_NAME$$ in the snippets.<br />
<br />
==== Formulas ==== <br />
<br />
These are small bits of ExtendScript (JavaScript)-like code which express how a certain placeholder can be calculated. <br />
<br />
The code is not 'pure' ExtendScript - there is some pre-processing to handle 'placeholders' in the scripting code.<br />
<br />
These files are stored in ''Personalities/Markdown/formulas''<br />
<br />
For example, the $$FRAME_PREFIX$$ placeholder in the ''frame.graphic.md.snippet'' is calculated by a formula in ''graphicframe.jsx.snippet''.<br />
<br />
<nowiki><br />
// ****************<br />
<br />
$$FRAME_PREFIX$$ = <br />
{<br />
var retVal = null;<br />
<br />
do<br />
{<br />
var granule = $$RAW_GRANULE$$;<br />
if (! (granule instanceof G.FrameGranule))<br />
{<br />
break;<br />
}<br />
<br />
var frame = granule.getData();<br />
if (frame.parent instanceof Character)<br />
{<br />
retVal = "";<br />
}<br />
else<br />
{<br />
retVal = " \n";<br />
}<br />
}<br />
while (false);<br />
<br />
return retVal;<br />
}<br />
</nowiki><br />
<br />
I won't go into too much detail, but essentially this is expressing that if the graphic frame has a Character as its 'parent' in InDesign (which means it is inline or anchored in text), then the return is "" (nothing). In all other cases, the return is " \n" (forced line break in Markdown).<br />
<br />
Another useful example can be found in ''files.jsx.snippet'': <br />
<br />
<nowiki>// ****************<br />
<br />
$$TODAYS_DATE$$ = <br />
{<br />
return new Date().toString();<br />
}<br />
<br />
// ****************<br />
</nowiki><br />
<br />
This defines a placeholder $$TODAYS_DATE$$ which can be used to insert the current date. For example, you could adjust the ''document.md.snippet'' from<br />
<br />
[//]: # (Document markdown file $$MARKDOWN_FILENAME$$ generated by $$SOFTWARE$$ $$VERSION$$)<br />
$$INPUT_TEXT$$<br />
<br />
to<br />
<br />
[//]: # (Document markdown file $$MARKDOWN_FILENAME$$ generated on $$TODAYS_DATE$$ by $$SOFTWARE$$ $$VERSION$$)<br />
$$INPUT_TEXT$$<br />
<br />
and then the Markdown files would contain the conversion date in a comment at the beginning.<br />
<br />
=== Handling the raw text ===<br />
<br />
One formula in particular is of interest. In the ''run.jsx.snippet'' you'll find the following formula:<br />
<br />
<nowiki><br />
// ****************<br />
<br />
$$RAW_TEXT$$ =<br />
{<br />
var retVal = undefined;<br />
do<br />
{<br />
retVal = $$RAW_TEXT$$;<br />
if (! retVal || retVal == "")<br />
{<br />
break;<br />
}<br />
<br />
retVal = retVal.replace(/#/g,"\\#")<br />
retVal = retVal.replace(/\*/g,"\\*")<br />
retVal = retVal.replace(/_/g,"\\_");<br />
retVal = retVal.replace(/!/g,"\\!")<br />
}<br />
while (false);<br />
<br />
return retVal;<br />
}<br />
<br />
// ****************<br />
</nowiki><br />
<br />
This formula is responsible for escaping special characters in the input. If additional characters need to be escaped, this is the place to do it.<br />
<br />
=== Useful config.ini settings ===<br />
<br />
==== MARKDOWN_FILENAME_EXTENSION ====<br />
<br />
This setting sets the file name extension to use for the output files (default: ''md'').<br />
<br />
==== imagesFolder ==== <br />
<br />
This is the name of the subfolder to use for storing the exported graphic frames into (default: ''images'').<br />
<br />
==== headingStylesLevel<n> ====<br />
<br />
Where <n> is 1 up to 6. This setting is a comma-separated list of paragraph style names that must be converted to a header of level 1 to 6 (defaults: headingStylesLevel1 is ''Heading, Title''; headingStylesLevel2 - headingStylesLevel6 are empty).<br />
<br />
==== minPointSizeLevel<n> ====<br />
<br />
Where <n> is 1 up to 6. This is a number of points from which text will be considered to be of the corresponding heading level. (defaults: minPointSizeLevel1 is ''18''; minPointSizeLevel2 - minPointSizeLevel6 are empty).<br />
<br />
For example, if minPointSizeLevel1 is 18, then any paragraph that starts with a glyph that is 18pt or larger will be considered to be a first-level heading. <br />
<br />
These can be left empty. When defining multiple levels, it should be true that minPointSizeLevel1 > minPointSizeLevel2 > ... > minPointSizeLevel6. <br />
<br />
In other words, if any, minPointSizeLevel6 has to be the smallest value, and minPointSizeLevel1 has to be the largest value.<br />
<br />
==== blockquoteStyles ====<br />
<br />
The names any paragraph styles that will be converted to blockquotes (i.e. prefixed with '> ') (default: nothing).<br />
<br />
==== imageExportDPI ====<br />
<br />
The image resolution to use for exporting any graphic frames (default: ''72'').<br />
<br />
==== imageExportFormat ==== <br />
<br />
This is either ''PNG'' or ''JPEG''. It tells Crawler what image format to use for the graphic frames (default: ''PNG'').<br />
<br />
''PNG'' is only supported in CS6 and above.<br />
<br />
==== boldFontStyles ====<br />
<br />
This setting is a list of font style names separated with | characters. The listed font style names will be considered to be bold (default ''bold|heavy|black'') <br />
<br />
This translates to prefixing and suffixing the text with two asterisks: **.<br />
<br />
==== italicFontStyles ====<br />
<br />
This setting is a list of font style names separated with | characters. The listed font style names will be considered to be italic (default ''italic|oblique|slanted'') <br />
<br />
This translates to prefixing and suffixing the text with an underscore: _.<br />
<br />
==== asBitmapTextFrameLabel ====<br />
<br />
This setting is a word you can assign to individual frames in the document via the ''Script Label'' panel (default: nothing).<br />
<br />
By entering this word onto the ''Script Label'' panel you can force selected text frames to be exported as bitmaps instead of as text.<br />
<br />
==== ignoreFrameLabel ====<br />
<br />
This setting is a word you can assign to individual frames in the document via the ''Script Label'' panel (default: ''ignore'').<br />
<br />
By entering this word onto the ''Script Label'' panel you can force the selected page item to be omitted from the output file.<br />
<br />
==== ignoreInvisibleLayers ====<br />
<br />
This setting is either 0 or 1 (default: ''0'').<br />
<br />
Setting it to 1 will force Crawler to omit any frames that are on invisible layers.<br />
<br />
==== ignoreLayers ====<br />
<br />
This setting is a comma-separated list of layer names. It can be left empty (default: nothing).<br />
<br />
Any items on a layer named here will be suppressed from the output.<br />
<br />
==== overrideAllMasterPageItems ====<br />
<br />
This setting is either 0 or 1 (default: ''0''). Setting it to 1 will force Crawler to express master page items in the export.<br />
<br />
==== textFrameSplitting ====<br />
<br />
This setting is either 0 or 1 (default: ''0''). If it is 0, the export will be done on a story-by-story basis. Setting it to 1 forces Crawler to export on a textframe-by-textframe basis.<br />
<br />
== Installation ==<br />
<br />
To install ''Crawler.ID2MD'', you first need to launch InDesign, and find the ''Scripts'' panel. If it is not visible, you can make it appear by means of the ''Window - Utilities - Scripts'' menu item.<br />
<br />
Once the Scripts panel is visible, right-click or <Control>-click the ''User'' entry and select 'Reveal in Finder' (on Mac) or 'Reveal in Explorer' (on Windows).<br />
<br />
[[File:Install01.png|330px]]<br />
<br />
A window on a folder called ''Scripts'' should open. <br />
<br />
Inside there should be a folder called ''Scripts Panel''. Double-click its icon to enter it.<br />
<br />
[[File:Install02.png|330px]]<br />
<br />
Once you're inside the ''Scripts Panel'' folder, you can drag the ''Crawler.ID2MD'' folder into it. <br />
<br />
[[File:Install03.png|537px]]<br />
<br />
Now switch back to InDesign, and verify that the ''Crawler.ID2MD'' folder has appeared under the ''User'' folder on the ''Scripts'' panel:<br />
<br />
[[File:Install04.png|330px]]<br />
<br />
Click the disclosure triangle - you should now see ''Export.jsxbin''.<br />
<br />
[[File:Install05.png|330px]]</div>Krishttps://www.docdataflow.com/wiki/index.php/App_ContextApp Context2014-01-04T04:47:50Z<p>Kris: Created page with "App Context"</p>
<hr />
<div>App Context</div>Krishttps://www.docdataflow.com/wiki/index.php/Custom_Personality_TutorialCustom Personality Tutorial2013-12-30T03:33:38Z<p>Kris: </p>
<hr />
<div>First, we'll build a very simple personality, and we'll gradually extend it to better demonstrate how Crawler works.<br />
<br />
The basis of nearly all document conversion personalities is the [[ViewExporter|''ViewExporter'']] adapter.<br />
<br />
The [[ViewExporter|''ViewExporter'']] is a complex adapter. Basically, it connects two 'main' sub-adapters: a [[Disassembler|''disassembler'']] (which breaks a document granule into smaller granules) and an [[Assembler|''assembler'']] (which takes the granules coming out of the disassembler, and builds the desired end-result).<br />
<br />
The disassembler is part of the default Crawler setup. When running Crawler, the [[ViewExporter|''ViewExporter'']] will ask the currently active application to provide it with an appropriate disassembler, and it will then use that disassembler in the [[ViewExporter|''ViewExporter'']].<br />
<br />
The disassembler gets further configuration through the configuration files.<br />
<br />
= Adjusting The Top-Level config.ini =<br />
<br />
First, we'll enhance the top-level configuration file so it knows about the new personality we're going to build.<br />
<br />
Let's call the personality 'tutorial'.<br />
<br />
Change the top-level [[INI file|''config.ini'']] (i.e. the ''config.ini'' which resides next to <code>Export.jsxbin</code>). Initially it it looks similar to this (I've omitted most comments for brevity).<br />
<br />
<pre><br />
[conditionals]<br />
<br />
selectors = text<br />
<br />
[main]<br />
<br />
personalityConfig?text = "./Personalities/Text/config.ini"<br />
<br />
# ********************************************************************************<br />
<br />
[debug]<br />
<br />
debugMonitoring = false<br />
logLevel = 0<br />
<br />
</pre><br />
<br />
Change it so it becomes like this:<br />
<br />
<pre><br />
[conditionals]<br />
<br />
selectors = tutorial<br />
<br />
[main]<br />
<br />
personalityConfig?tutorial = "./Personalities/Tutorial/config.ini"<br />
personalityConfig?text = "./Personalities/Text/config.ini"<br />
<br />
# ********************************************************************************<br />
<br />
[debug]<br />
<br />
debugMonitoring = true<br />
monitorAdapters = inputSplitter<br />
<br />
logLevel = 5<br />
logFileName = Crawler.log<br />
</pre><br />
<br />
This tells Crawler that we want to select 'tutorial', and it also says that the ''personalityConfig'' entry needs to be the lower-level ''config.ini'' inside the Tutorial folder inside the Personalities folder.<br />
<br />
We also switch on debug monitoring, and hook a [[Debug Monitor|''Debug Monitor'']] into the [[Splitter|''inputSplitter'']] adapter inside the [[ViewExporter|''ViewExporter'']].<br />
<br />
= Creating A Tutorial Personality =<br />
<br />
Now that Crawler 'knows' about the new personality, the next step is to make a start building it. <br />
<br />
Open the Personalities folder, and create a new subfolder called ''Tutorial''. Inside that subfolder, create a text file called ''config.ini''.<br />
<br />
[[File: TutorialInitialPersonality.png]]<br />
<br />
Put the following text in this personality-level config.ini file:<br />
<br />
<pre><br />
[main]<br />
<br />
views = tutorialView<br />
nesting = document/text.story<br />
<br />
[main:tutorialView]<br />
<br />
accepted = text.story<br />
<br />
</pre><br />
<br />
== View ==<br />
<br />
With this config file, we tell the [[ViewExporter|''ViewExporter'']] that we only need a single view, named ''tutorialView''.<br />
<br />
Later on, we'll build personalities with multiple views. Views are a way to concurrently build separate, but related files.<br />
<br />
For example, when converting to XHTML, we need to build a CSS structure as well as an XHTML structure, and keep track of how they relate to one another.<br />
<br />
This kind of 'interrelated' file building is handled through views in Crawler.<br />
<br />
So for XHTML+CSS conversion, we'll have two views.<br />
<br />
For this first tutorial, we don't have a need to build multiple views concurrently, so we can make so with just the single ''tutorialView''.<br />
<br />
== Disassembly Hierarchy ==<br />
<br />
We'll be processing InDesign documents, which have a 'natural' hierarchy: <br />
* documents contain stories<br />
* stories contain paragraphs<br />
* paragraphs contain text runs<br />
* text runs contain words. <br />
<br />
This is not the only hierarchy we could use in InDesign documents. An alternate hierarchy would be <br />
* documents contain spreads<br />
* spreads contain text frames<br />
* text frames contain text runs<br />
* text runs contain words. <br />
<br />
This alternate hierarchy does not 'map' onto the first hierarchy: text frames do not map cleanly onto paragraph boundaries or vice versa.<br />
<br />
In this case, we're initially interested in getting the text, and we don't care too much about the lower level granules, so all we tell the [[Disassembler|''disassembler'']]:<br />
<br />
<code>nesting = document/text.story</code>.<br />
<br />
This tells the [[Disassembler|''disassembler'']]: if you see a <code>document</code> granule, please find a way to disassemble it into <code>text.story</code> granules. <br />
<br />
When presented with a document granule on the input side, the [[Disassembler|''disassembler'']] will spit out a few story granules, followed by the original document granule at the output side. <br />
<br />
A disassembler never take granules away: it will only add to the input stream. So the original ''document'' granule will still be output, but it will be preceded by the ''text.story'' granules extracted from the document.<br />
<br />
Later on, we'll tell the disassembler to dig deeper than that.<br />
<br />
== Granule Class Identifier Shorthand == <br />
<br />
The <code>nesting</code> entry is a slash-separated list of [[Class identifier|''class identifiers'']], which represent some granule classes. The expression <br />
<br />
<pre><br />
[main]<br />
...<br />
nesting = document/text.story<br />
...<br />
</pre><br />
<br />
is actually shorthand for:<br />
<br />
<pre><br />
[main]<br />
...<br />
nesting = com.rorohiko.granule.document/com.rorohiko.granule.text.story<br />
...<br />
</pre><br />
<br />
because Crawler allows granule class identifiers to be shortened by dropping the <code>com.rorohiko.granule.</code> prefix.<br />
<br />
== Log Output ==<br />
<br />
Our rudimentary personality is not complete yet, but we can already try to run it. <br />
<br />
We won't get any meaningful output just yet, but we've configured a debug monitor, and a log file, so we'll see some useful information there.<br />
<br />
If you run the <code>Export.jsxbin</code> script on an InDesign document, you'll get something like this in the Crawler.log file that should appear next to the <code>Export.jsxbin</code> file:<br />
<br />
<pre><br />
Mon Dec 30 2013 18:02:18 GMT+1300: Error: OutputTextFile.prototype.dumpData_protected_: needs file<br />
Mon Dec 30 2013 18:02:18 GMT+1300: Note : ViewSplitter 'inputSplitter' input log:<br />
**************************<br />
InDesignStoryGranule [Facesti quo tet, offictation reculpa ritiand icatum doles maios re ...]<br />
<br />
InDesignStoryGranule [Pudipsunt alitionet que labo. Liquo cor rerundelento eos et apiet ...]<br />
<br />
InDesignDocumentGranule [TutorialTest.indd]<br />
<br />
**************************<br />
<br />
Mon Dec 30 2013 18:02:18 GMT+1300: Error: OutputTextFile.prototype.dumpData_protected_: needs file<br />
</pre><br />
<br />
We can see the [[OutputTextFile|''OutputTextFile'']] adapter is unhappy (<code>Error: OutputTextFile.prototype.dumpData_protected_: needs file</code>) because it does not know where it needs to send its output. We'll fix that soon. <br />
<br />
The area of interest are the granules that trickled through the adapter network: we can see we had two story granules, followed by a document granule.<br />
<br />
== Getting The Text Into a File ==<br />
<br />
We'll now modify the personality-level config.ini a little bit so the text gets dumped into a file. <br />
<br />
The sub-adapters used in the [[ViewExporter|''ViewExporter'']] in Crawler look at a number of predefined [[Context Variable|''context variables'']] to determine what to do. <br />
<br />
One of these many predefined variables is called <code>FILEPATH</code>.<br />
<br />
The [[OutputTextFile|''OutputTextFile'']] adapter will query the context for a [[Context Variable|''variable'']] called <code>FILEPATH</code>. <br />
<br />
If this [[Context Variable|''variable'']] is defined and contains a path to a file, then it the [[OutputTextFile|''OutputTextFile'']] adapter will dump its output into that file.<br />
<br />
We can set the <code>FILEPATH</code> variable in the 'global' app context for the application by means of the personality-level [[INI file|''config.ini'']] file. <br />
<br />
Contexts are arranged in a hierarchy; the 'topmost' context is the [[App Context|''app context'']] which we can influence from the [[INI file|''config.ini'']] file.<br />
<br />
Change the config.ini in the Tutorial directory so it becomes:<br />
<br />
<pre><br />
[main]<br />
<br />
views = tutorialView<br />
<br />
nesting = document/text.story<br />
<br />
[appContextData]<br />
<br />
FILEPATH?Mac = ~/Desktop/output.txt<br />
FILEPATH?Win = C:\tmp\output.txt<br />
<br />
[main:tutorialView]<br />
<br />
accepted = document, text.story<br />
</pre><br />
<br />
The added entry ''FILEPATH...'' in the [appContextData] section defines a context variable <code>FILEPATH</code>. <br />
<br />
Because Mac and Windows are different, we're using a [[INI_file#Conditional_entries|''selector'']] to decide which FILEPATH to use. <br />
<br />
Only one of the two entries shown will be used, depending on the platform we're using. The two selectors ''Mac'' and ''Win'' are system-defined.<br />
<br />
This FILEPATH value is picked up by the [[OutputTextFile|''OutputTextFile'']] adapter embedded inside the [[ViewExporter|''ViewExporter'']].<br />
<br />
Run the <code>Export.jsxbin</code> again. <br />
<br />
You should now end up with a file called ''output.txt'' on the desktop. This file will contain all the text extracted from the InDesign document.<br />
<br />
== Book Files ==<br />
<br />
Now close all documents, and open or create just an InDesign book file that references a few InDesign files. <br />
<br />
Keep all documents closed; only the book file should be open.<br />
<br />
Run the <code>Export.jsxbin</code> again. <br />
<br />
This time, you should end up with the collated text from all the book's documents concatenated together.<br />
<br />
You might wonder how this could possibly work, given that the nesting in the config file does not mention book granules. You might expect that you would have needed to change the config.ini to read:<br />
<br />
<pre><br />
[main]<br />
...<br />
nesting = book/document/text.story<br />
</pre><br />
<br />
meaning:<br />
<br />
* start from a book<br />
* descend into the documents<br />
* descend into the text stories in each document<br />
<br />
This longer nesting string <code>book/document/text.story</code> would also work for processing books, but it would not work for single documents. <br />
<br />
When processing a single document, the initial granule that starts the process is a document, not a book. The nesting string <code>book/document/text.story</code> wants to see a book granule before it starts 'digging in'. <br />
<br />
The shorter nesting string <code>document/text.story</code> is better, because it works for both books and documents.<br />
<br />
The underlying mechanism works as follows: when we run ''Export.jsxbin'', a granule comes down the line to be processed. It will be either a book granule or a document granule. <br />
<br />
When it is a document granule, we get a match with the top-level nesting element, and the disassembly process can properly start. <br />
<br />
If it is a book granule, there is no matching granule class at the start of the nesting string <code>document/text.story</code>. <br />
<br />
When that happens, the ''ViewExporter'' will try to automatically pick a default approach to disassemble. We're not telling it what it should do with book granules, so it 'makes something up'.<br />
<br />
For books, it will disassemble the book into its component document granules, and then process these document granules one by one. <br />
<br />
Each of these document granules will then also be presented to the ViewExporter. <br />
<br />
These granules do match the starting entry of the nesting string <code>document/text.story</code> and the 'normal' document disassembly will take its course from there on.<br />
<br />
== Nesting String ==<br />
<br />
Change the config.ini to look similar to the following:<br />
<br />
<pre><br />
[main]<br />
<br />
views = tutorialView<br />
<br />
nesting = document/text.word<br />
<br />
[appContextData]<br />
<br />
FILEPATH?Mac = ~/Desktop/output.txt<br />
FILEPATH?Win = C:\Users\kris\Desktop\output.txt<br />
<br />
[main:tutorialView]<br />
<br />
accepted = text.word<br />
</pre><br />
<br />
If you're on Windows, you'll need to adjust the file path so it points to a writable directory; the example shows the path to my desktop directory.<br />
<br />
=== Disassembling Down To The Word Level ===<br />
<br />
There are two changes: the ''nesting'' string now mentions ''text.word'' instead of ''text.story'' and the ''accepted'' entry now lists ''text.word'' instead of ''text.story'.<br />
<br />
This setup will grab an incoming input granule (either a document or a book granule), and then gradually peel it apart, layer by layer, until it reaches the individual words hidden inside.<br />
<br />
=== Auto-Discovery Of Word Granules In Document ===<br />
<br />
The 'defaulting' mechanism mentioned in the previous section also occurs each time we have unmatched elements in the middle of the nesting string. <br />
<br />
Essentially, without instructions from the ''nesting'' string, ''ViewExporter'' will try to 'dig deeper' into the data, disassembling level by level, until it either 'hits' the desired granule class, or comes up empty.<br />
<br />
When a ''document'' is disassembled, the ViewExporter will try to match the next required granule class shown in the ''nesting'' string. In this example that granule class is ''text.word''.<br />
<br />
Documents don't disassemble readily into words. So, the defaulting mechanism kicks in, and the document is disassembled into some smaller granules. The default ViewExporter behavior is to disassemble a document into ''spread'' granules. <br />
<br />
ViewExporter will now try to try to disassemble these ''spread'' granules to the desired ''text.word'' granule class. <br />
<br />
Just as with ''document'' granules, ''spread'' granules don't split up into words, so it will again have to call on some default behavior. <br />
<br />
The default is to disassemble a ''spread'' granule into frame granules. Some, but not necessarily all, of these frames will be text frames. <br />
<br />
ViewExporter will now try to try to disassemble these ''frame'' granules to the desired ''text.word'' granule class. <br />
<br />
For image frames, that won't work. It will try digging a bit deeper using the same defaulting mechanism at each 'level', but eventually it'll come up empty: there are no word granules to be found in an image frame.<br />
<br />
For text frames, it will finally work: ''ViewExporter'' does know how to grab the contents of a text frame and disassemble it into words.<br />
<br />
This finally satisfies the nesting string which was calling for ''text.word'' granules.<br />
<br />
=== Stop Drilling Down When All Nesting Entries Have Been Matched === <br />
<br />
As the complete nesting string has now been matched, no further disassembly takes place. It will not try to disassemble the ''text.word'' granules any further.<br />
<br />
For an InDesign document the nesting string <code>document/text.word</code> is more or less equivalent to a nesting string <code>document/spread/text.frame/text.word</code>. <br />
<br />
The main difference is that we leave it up to the ''ViewExporter'' to figure out how it will get from ''document'' granules down to ''text.word'' granules.<br />
<br />
=== Log File === <br />
In the log file you'll see something like:<br />
<pre><br />
**************************<br />
<br />
Thu Jan 16 2014 14:07:30 GMT+1300: Note : ViewSplitter 'inputSplitter' input log:<br />
**************************<br />
InDesignWordGranule [Pudipsunt]<br />
<br />
InDesignWordGranule [alitionet]<br />
<br />
InDesignWordGranule [que]<br />
<br />
InDesignWordGranule [labo.]<br />
...<br />
InDesignWordGranule [que]<br />
<br />
InDesignWordGranule [vel]<br />
<br />
InDesignWordGranule [molest]<br />
<br />
InDesignTextFrameGranule [Pudipsunt alitionet que labo. Liquo cor rerundelento eos et apiet dende arum aliquos ipit qui ditiusa eriam, cupti antiis vendignis et derferum et offictiatem quatus everspis et aut aut erferfe rspelig eniscipit aut eiciassi tet ute peror aut etur ratur ratenist quatium aut odignamet est, sequiatures eos moloren dantin re occaescium none cupiet aut officie nihita plabori temporem ipsanduntur?<br />
Ihilignam, consequis assitiis pore, nonsendus expla niscitia consequiam est, comnia simos nus, sunt doloria cumque peruptatet inverunt.<br />
Faceati officabo. Loreper iorrum esti venisciis magnam renihit volor molland elenem aris non comnimin re porrora dolectas endic tem enihitio et lit modis acerite pore consernatem. Ga. Soluptatur accus quia volut quat preperi dolorro bla quaspe nectet que none nos ipidempor aliquiam vellore nusdame ndunt, iurest, te moleste mpore, niminctem asita non perovidel ipit alibusd andeni ommoluptibus quat ese nisquos quo enda num qui odicimpores alibus, illenda quiam fuga. Et velibus exera sum ate sedisitatias earciassunt vid excea quaepudit, comnia aliquas perchit ectest, cullaborum et quidelendia aut volorru ntiunto ipid quod ut alibusae sim quat ex exerest adi idis dolorib eruptati denis que vel molest]<br />
<br />
InDesignWordGranule [Facesti]<br />
...<br />
<br />
InDesignTextFrameGranule [Facesti quo tet, offictation reculpa ritiand icatum doles maios re sime niet ma consequae pelignimus doluptia planihi ctorae pro in perum ius est vel int.<br />
Inulpa si tem sundaec tinctur secuptaecea pelitat iorest, accum iducient, utes aut mos modis ad que pernamus aut mo optae voloreptate pe pellaut que volo eaqui omnis eaque quatur aut mo to quaectur, conet aspel evelest am que nem essum aliquidiam enis invelestio to testi quae nonsect ibusda nonecus estis solorernam et quiscim agnimi, in pro is volor sit exces eum qui nobitet ut re volupti orporeris alibus aut aut laut entum et qui aditaturem volorer ibusam quatem rehent excessi blaute dolorepro bero omniend igentis quianienis expliqu idellabo. Ume quatecta doluptat eum autentotae solut dolupta spidus non nullupt aquamusa qui dionsequiae. Et hil ipsapist quo totata volorei cture, sed molore evellig nimilis prae secus molorum anda sediti accus preperi aspero volor aut rendion por minimus reperum reces eat es nim nulparunt, cum rem rero consequiaspe voloriate post, sumenimagnis rerupti doloreperiat listo comnim quiatur, que re pratur? Quiae nimus aute nobitassimi, quundellori core, cones nos ad et et velent accabo. Et lit volute eum lique nihicim poressiti adite velitat la voloribus, sumet ipsum faccabo. Ut ut officium diaeria ne aut vellupiti]<br />
<br />
InDesignSpreadGranule [Spread #1]<br />
<br />
InDesignDocumentGranule [TutorialTest2.indd]<br />
<br />
**************************<br />
<br />
</pre><br />
<br />
Reading back to front, you can see that a document is broken up into spreads, spreads are broken up into frames, and the text frames are broken up into words.<br />
<br />
== Assembling ==<br />
<br />
As these granules come 'down the pipe' we need to re-assemble them again into some kind of output. <br />
<br />
That's where the ''accepted'' comes in.<br />
<br />
<pre><br />
...<br />
[main:tutorialView]<br />
...<br />
accepted = text.word<br />
</pre><br />
<br />
The ''tutorialView'' accepts ''text.word'' granules. In other words: of all the granules we see in the log file, only the ''text.word'' granules are 'let through' into the view for re-assembly.<br />
<br />
The output file will not look as good as before: all words will be running together without intervening spaces.<br />
<br />
== Changing The Disassembly Sequence ==<br />
<br />
Currently, the source document is disassembled in story order: the disassembler goes into the document, then enumerates all the stories. <br />
<br />
The text is output story by story, which does not take into account any of the story flow through threaded text frames.<br />
<br />
To change the order, we can change the nesting string. Change your config to look similar to the following:<br />
<br />
<pre><br />
[main]<br />
<br />
views = tutorialView<br />
<br />
nesting = document/frame/text.paragraph<br />
<br />
[appContextData]<br />
<br />
FILEPATH?Mac = ~/Desktop/output.txt<br />
FILEPATH?Win = C:\Users\kris\Desktop\output.txt<br />
<br />
[main:tutorialView]<br />
<br />
accepted = text.paragraph<br />
</pre><br />
<br />
This nesting string tells the ViewExporter to start from document, drill down into frame granules (which it will do by way of the spreads). <br />
<br />
From the frame granules, it then drills down into the paragraphs.<br />
<br />
The view is then set to only accept text.paragraph granules. <br />
<br />
This small change causes the document text to be output in text frame order instead of in story order.<br />
<br />
== Re-ordering Granules ==<br />
<br />
The order of the text frames in the previous test is somewhat random, and not explicitly specified. If we wanted the frames to be output in a particular order, we can add ordering info to the nesting string.<br />
<br />
Ordering is defined by a [[Ordering Classes|''ordering class'']] reference, which is prefixed with either a + (ascending) or - (descending).<br />
<br />
For frames, some pre-existing ordering classes exist: ''com.rorohiko.ordering.frame.horizontal'' and ''com.rorohiko.ordering.frame.vertical''. <br />
<br />
Just as with granule class identifiers, there is a shorthand, and the ''com.rorohiko.ordering.'' prefix can be omitted, so ''com.rorohiko.ordering.frame.horizontal'' and ''frame.horizontal'' are equivalent.<br />
<br />
Crawler is extensible, and additional ordering classes can easily be added to the system, and referred to from the config.ini file. These orderings could provide different kinds of orderings, or could be used for for different granule classes.<br />
<br />
The following change makes the frames be ordered top-to-bottom; and for frames on the same vertical position, these are further ordered left-to-right.<br />
<br />
<pre><br />
[main]<br />
<br />
views = tutorialView<br />
<br />
nesting = document/frame/+frame.vertical/+frame.horizontal/text.paragraph<br />
<br />
[appContextData]<br />
<br />
FILEPATH?Mac = ~/Desktop/output.txt<br />
FILEPATH?Win = C:\Users\kris\Desktop\output.txt<br />
<br />
[main:tutorialView]<br />
<br />
accepted = text.paragraph<br />
</pre><br />
<br />
Simply changing a + to a - reverses the ordering.</div>Krishttps://www.docdataflow.com/wiki/index.php/Custom_PersonalitiesCustom Personalities2013-12-30T03:25:22Z<p>Kris: Created page with "The following information is only useful to people with access to a 'middle' or 'high end' version of Crawler. If you're using a low end version of Crawler with a standard p..."</p>
<hr />
<div>The following information is only useful to people with access to a 'middle' or 'high end' version of Crawler. <br />
<br />
If you're using a low end version of Crawler with a standard personality, the following information is not applicable.<br />
<br />
[[Custom Personality Tutorial|Tutorial]]</div>Krishttps://www.docdataflow.com/wiki/index.php/FontGranuleFontGranule2013-12-30T02:22:36Z<p>Kris: Created page with "= General Info = A FontGranule represents a particular font reference used by some application in its document types. Currently, if there are font variations (e.g. bold, ital..."</p>
<hr />
<div>= General Info =<br />
<br />
A FontGranule represents a particular font reference used by some application in its document types. Currently, if there are font variations (e.g. bold, italic), these cause different font granules to be created: each variant is represented by a different font granule.<br />
<br />
= Developer Info =<br />
[[FontGranule Code| FontGranule Documentation]]</div>Krishttps://www.docdataflow.com/wiki/index.php/DocumentElementGranule_CodeDocumentElementGranule Code2013-12-30T02:18:06Z<p>Kris: Created page with "The DocumentElementGranule class is derived from the ''Granule'' class. It has a ''class identifier'' of <code>com.rorohiko.granule.docu..."</p>
<hr />
<div>The DocumentElementGranule class is derived from the [[Granule Code|''Granule'']] class.<br />
<br />
It has a [[Class identifier|''class identifier'']] of <code>com.rorohiko.granule.documentelement</code>.<br />
<br />
It is used as the base granule class for granule classes that 'wrap' various document elements.<br />
<br />
The following methods are provided:<br />
<br />
* <code>documentElementGranule.getAppContext()</code>: get the current application context for this document granule. In the first version of Crawler, this is a single global context, shared by the whole Crawler session, but in the future, Crawler might support multiple concurrent applications during a single conversion session, in which case the appContext can vary between document granules.<br />
<br />
* <code>documentElementGranule.getDocumentGranule()</code>: returns the 'owning document' for this granule.</div>Krishttps://www.docdataflow.com/wiki/index.php/Granule_CodeGranule Code2013-12-30T01:47:57Z<p>Kris: </p>
<hr />
<div>The Granule class is the root class from which all granule types are derived.<br />
<br />
It has a [[Class identifier|''class identifier'']] of <code>com.rorohiko.granule</code><br />
<br />
Granules are 'constant': once created they don't change. Any 'live' data related to the granule is carried in the associated [[Context|''context'']] or associated meta-data structures.<br />
<br />
It has the following methods:<br />
<br />
* <code>granule.comparePosition(orderingList, compareWithGranule)</code>: return an int (< 0, == 0, > 0) after comparing the granules. The orderingList is a linear list of [[Granule Ordering|granule ordering]].<br />
<br />
* <code>granule.getId()</code>: get the unique identifier for this granule.<br />
<br />
* <code>granule.getContext()</code>: retrieve the context for this granule.<br />
<br />
* <code>granule.getData()</code>: retrieve the core data for this granule. In most cases this is a reference to the document-specific data structure for the granule. E.g. if the granule is an InDesignTextFrameGranule, the granule.getData() will retrieve the associated InDesign TextFrame object (or a proxy thereof).<br />
<br />
* <code>granule.getDataAsDebugString()</code>: retrieve the core data for this granule in human-readable form.<br />
<br />
* <code>granule.getDataAsString()</code>: retrieve the core data for this granule in string form.<br />
<br />
* <code>granule.getMetaData(metaDataKey)</code>: return some meta-data associated with the granule. To associate meta-data, use <code>granule.setMetaData</code>. To avoid meta data key clashes, use reverse-domain names for the meta data keys. This mechanism can be used for attaching some temporary helper-data to a granule, and communicate meta-data between adapters along the data flow.<br />
<br />
* <code>granule.getParentGranule()</code>: retrieve the parent granule for this granule. Eventually will lead to an AppGranule - appGranules are the root(s) for the granule hierarchy.<br />
<br />
* <code>granule.getName()</code>: retrieve a human-readable name for the granule.<br />
<br />
* <code>granule.getRoutingGranuleClass()</code>: Granules can be replaced by proxies if necessary. This method provides access to the granule class of the original granule, which is used for routing the proxy through the data flow as if it is the original granule.<br />
<br />
* <code>granule.getRoutingId()</code>: Granules can be replaced by proxies if necessary. This method provides access to the granule identifier of the original granule, which is used for routing the proxy through the data flow as if it is the original granule (e.g. it is used to determine the visit counts for the granule).<br />
<br />
* <code>granule.setMetaData(metaDataKey, metaData)</code>: associate some meta-data with the granule. To avoid meta data key clashes, use reverse-domain names for the meta data keys.</div>Krishttps://www.docdataflow.com/wiki/index.php/DocumentGranule_CodeDocumentGranule Code2013-12-30T01:42:00Z<p>Kris: </p>
<hr />
<div>The DocumentGranule class is derived from the [[Granule Code|''Granule'']] class.<br />
<br />
It has a [[Class identifier|''class identifier'']] of <code>com.rorohiko.granule.document</code>.<br />
<br />
The following methods are provided:<br />
<br />
* <code>documentGranule.getAppContext()</code>: get the current application context for this document granule. In the first version of Crawler, this is a single global context, shared by the whole Crawler session, but in the future, Crawler might support multiple concurrent applications during a single conversion session, in which case the appContext can vary between document granules.<br />
<br />
* <code>documentGranule.getDocumentGranule()</code>: returns the 'owning document' for this granule. Because the DocumentGranule represents the document it is kind of a do-nothing: it returns the granule itself. This same method is available for all document-derived granules, and it is available here too for reasons of symmetry: if you have a document-derived granule, calling getDocumentGranule() will give you the 'owning document'.<br />
<br />
* <code>documentGranule.getFile()</code>: returns the associated file on disk (if any).<br />
<br />
* <code>documentGranule.isValid()</code>: verifies whether the granule is still valid. In some cases, processing affects the validity of the document granule; if the document granule refers to an underlying document data that's become invalid, this method will reflect that. In that case the granule will need to be dropped, and a new granule constructed to 'wrap' the replacement document.</div>Krishttps://www.docdataflow.com/wiki/index.php/DocumentGranuleDocumentGranule2013-12-30T01:33:53Z<p>Kris: /* Developer Info */</p>
<hr />
<div>= General Info =<br />
<br />
A DocumentGranule represents a particular document.<br />
<br />
= Developer Info =<br />
[[DocumentGranule Code| DocumentGranule Documentation]]</div>Krishttps://www.docdataflow.com/wiki/index.php/ColorGranule_CodeColorGranule Code2013-12-30T01:30:20Z<p>Kris: </p>
<hr />
<div>The ColorGranule class is derived from the [[Granule Code|''Granule'']] class.<br />
<br />
It has a [[Class identifier|''class identifier'']] of <code>com.rorohiko.granule.color</code>.<br />
<br />
The following methods are provided:<br />
<br />
* <code>colorGranule.getCSS()</code>: fetch the CSS color hex values for this color. The colorGranule needs to take care of the 'whatever-to-RGB' conversion: some document formats support a whole range of color models or support tints. colorGranule.getCSS() provides an RGB 'best effort' value that converts to RGB, and applies any tint value if necessary.</div>Krishttps://www.docdataflow.com/wiki/index.php/ColorGranuleColorGranule2013-12-30T01:26:16Z<p>Kris: </p>
<hr />
<div>= General Info =<br />
<br />
A ColorGranule represents a particular color used by some application in its document types, e.g. an InDesign swatch would be carried by a ColorGranule.<br />
<br />
= Developer Info =<br />
[[ColorGranule Code| ColorGranule Documentation]]</div>Krishttps://www.docdataflow.com/wiki/index.php/AppGranule_CodeAppGranule Code2013-12-30T01:19:49Z<p>Kris: </p>
<hr />
<div>The AppGranule class is derived from the [[Granule Code|''Granule'']] class.<br />
<br />
It has a [[Class identifier|''class identifier'']] of <code>com.rorohiko.granule.app</code>.<br />
<br />
AppGranule is currently a singleton-class: only one instance of AppGranule (or a subclass of AppGranule) is present during a Crawler session. Future versions of Crawler might support multi-app conversions, where the Crawler dataflow is hosted by multiple concurrent apps.<br />
<br />
The following static methods are provided:<br />
<br />
* <code>AppGranule.activeAppGranuleFactory()</code>: creates or retrieves the current AppGranule for the currently active app. <br />
<br />
The following methods are provided:<br />
<br />
* <code>appGranule.activeDocumentGranuleFactory()</code>: creates or retrieves the DocumentGranule for the currently active document.<br />
<br />
* <code>appGranule.adapterFactory(adapterParentClass)</code>: create a new app-specific adapter which is a subclass of the adapterParentClass. <br />
<br />
At present, this is used to ask the current appGranule for a disassembler. Because the Crawler system does not know what document types it is processing, it relies on the currently active appGranule to provide it with the correct document disassembler for document conversion.<br />
<br />
The predefined ViewExporter uses the equivalent of the following code to get hold of a disassembler:<br />
<br />
<pre><br />
...<br />
var appGranule = AppGranule.activeAppGranuleFactory();<br />
...<br />
var disassembler = appGranule.adapterFactory(Disassembler);<br />
...<br />
</pre><br />
<br />
The ViewExporter does need to not 'know' what app is currently active: it gets hold of the singleton appGranule, and then asks that appGranule to provide it with a proper disassembler to break the document apart into smaller granules.</div>Krishttps://www.docdataflow.com/wiki/index.php/AppGranuleAppGranule2013-12-30T01:06:58Z<p>Kris: </p>
<hr />
<div>= General Info =<br />
<br />
An AppGranule represents a particular application - e.g. Adobe InDesign, Illustrator, QuarkXPress...<br />
<br />
AppGranule normally don't roam the data flow. Instead they represent the application that's hosting the current Crawler conversion. <br />
<br />
The AppGranule provides an easy 'repository' for application-specific data. <br />
<br />
Another function of the AppGranule is to serve as the 'parent granule' of all 'top level' granules.<br />
<br />
Normally, there is only a single AppGranule for the current Crawler session.<br />
<br />
E.g. when running an InDesign-to-something conversion, the 'host' application will be InDesign, and there will be a single InDesignAppGranule available.<br />
<br />
= Developer Info =<br />
[[AppGranule Code|AppGranule Documentation]]</div>Krishttps://www.docdataflow.com/wiki/index.php/ScriptedScripted2013-12-29T23:39:37Z<p>Kris: </p>
<hr />
<div>A scripted adapter is an [[Atomic adapter|atomic adapter]].<br />
<br />
It does not have any particular kind of predetermined behavior: it's up to the script developer to define it through scripting.<br />
<br />
In most cases, it'll be probably similar to a [[Filter|''filter'']] or a [[Processor|''processor'']].</div>Krishttps://www.docdataflow.com/wiki/index.php/SplitterSplitter2013-12-29T23:32:45Z<p>Kris: Created page with "A splitter is a ''composite adapter''. It manages a group of two or more 'sub-adapters'. When it receives a granule through its input connection, it w..."</p>
<hr />
<div>A splitter is a [[Composite Adapter|''composite adapter'']]. <br />
<br />
It manages a group of two or more 'sub-adapters'.<br />
<br />
When it receives a granule through its input connection, it will send the granule to ''all'' of the sub-adapters that are willing to [[Granule Acceptance|accept]] it. <br />
<br />
It is used to create multiple parallel data flows in the data flow network.<br />
<br />
The output connection of the splitter is not used. If none of the sub-adapters is prepared to accept a particular granule, then the splitter acts as a sink, and the granule disappears from the data flow.<br />
<br />
[[File:splitter.png|800px]]</div>Krishttps://www.docdataflow.com/wiki/index.php/Granule_AcceptanceGranule Acceptance2013-12-29T22:33:19Z<p>Kris: </p>
<hr />
<div>= Three Main Criteria =<br />
<br />
An important mechanism in Crawler is the idea of 'granule acceptance' by [[Adapter|''adapters'']]. <br />
<br />
When a [[Granule|''granule'']] is presented to any [[Adapter|''adapter'']] for processing, the adapter can accept or reject the granule based on a number of criteria.<br />
<br />
Some of these criteria are part of the default infrastructure of Crawler, and are provided automatically, by default. <br />
<br />
These automatic criteria can always be overruled by a specific types of adapter or adapter network. <br />
<br />
These default criteria are only provided for convenience, and they will do 'the right thing' in most cases. <br />
<br />
They can be adjusted for the more uncommon cases where the acceptance criteria need to be different.<br />
<br />
== Visit Counting ==<br />
<br />
The first default criterium: by default, granules are not accepted twice by the same adapter: only one 'visit' is allowed. This behavior can be overridden.<br />
<br />
In some of the more complex personalities, you might see 'adapter loops': networks of adapters where the output of an adapter further down the data flow feeds back into the input of an adapter earlier in the data flow. <br />
<br />
These loops will often rely on the 'don't accept twice' mechanism to avoid getting caught into endless loops.<br />
<br />
A practical example. Below a schematic representation of the adapter network used for document conversion in Crawler:<br />
<br />
[[File:Sampleexporter.png|800px]]<br />
<br />
Note that the ViewAssembler sits at the core of a number of 'adapter loops'. <br />
<br />
The 'ViewAssembler' and [[Selector|'Selector']] adapters in this network are exceptions. They have been modified to allow unlimited visits, so they both allow granules to 'pass through' more than once. <br />
<br />
On the other hand, the individual [[Processor|''Processor'']] sub-adapters of the [[Selector|''Selector'']] ''do'' use the default visit counting: they only allow one visit by any granule. <br />
<br />
That means that once a granule goes round one of the loops, it'll go back through the ViewAssembler, then the [[Selector|''Selector'']]. <br />
<br />
The [[Selector|''Selector'']] will not send the granule back to the same [[Processor|''Processor'']] adapter because that processor will reject it: it has 'seen' that granule before, and it only allows one visit. <br />
<br />
As a result, the Selector will work its way down its list of options, and every time round it will pick the next eligible adapte. <br />
<br />
If there aren't any more, it'll pick the Output adapter.<br />
<br />
In this example, the visit counting is used to set up a mechanism where granules go round and round the network, but take a different path every time.<br />
<br />
Each and every granule which ever roams the data flow network gets assigned a unique identifier when it is created. <br />
<br />
Once created, a granule never changes: all that can happen to it is that it can be dropped from the data flow, and/or replaced by one or more new granules with different identifiers.<br />
<br />
Through this unique identifier, adapters are able to track how many times they've seen a particular granule.<br />
<br />
== Granule Type Acceptance ==<br />
<br />
A second default criterium is the granule type. <br />
<br />
Every adapter can be configured to only accept particular granule types. <br />
<br />
Many adapters will not use this mechanism, and simply accept all granule types. For example, most [[Selector|''selectors'']] will accept any granule type. <br />
<br />
But the 'sub-adapters' of such selector will often use the granule type to accept or reject a certain granule, and hence help the [[Selector|''Selector'']] to decide what sub-adapter the granule should be sent to.<br />
<br />
== Programmatic Acceptance ==<br />
<br />
The third criterium is programmatically defined. When a software developer creates an adapter, they can opt to implement a special method ''canProcessGranule'', which either returns true or false. <br />
<br />
The default implementation of ''canProcessGranule'' implements the ''visit count'' and ''granule type'' granule acceptance mechanism.<br />
<br />
A customized adapter can either enhance or re-implement this method, and use various other criteria to accept or reject a granule.<br />
<br />
For example, some adapter could be made to accept only paragraph granules that have a certain minimum length. Or an adapter could be made to only accept InDesign text frame granules that have a background color, and so on...</div>Krishttps://www.docdataflow.com/wiki/index.php/SelectorSelector2013-12-29T21:59:23Z<p>Kris: </p>
<hr />
<div>A selector is a [[Composite Adapter|''composite adapter'']]. <br />
<br />
It manages a group of two or more 'sub-adapters'.<br />
<br />
When it receives a granule through its input connection, it will use the [[Granule Acceptance|''granule acceptance'']] mechanism to decide which the most appropriate target adapter is, and it will send the granule to one (and only one) of its sub-adapters.<br />
<br />
The output connection of the selector serves as the 'default option': if none of the sub-adapters is prepared to accept a particular granule, then the granule is routed through the output connection of the selector.<br />
<br />
[[File:selector.png|800px]]<br />
<br />
The sub-adapters are in a sequential list; the position of a sub-adapter in the list is important. <br />
<br />
The sub-adapters are tried out in sequence: the first adapter willing to accept a particular granule will be the one selected. Any subsequent adapters in the selector's sub-adapter list don't even get to 'see' the granule once it's routed to the selected sub-adapter.</div>Krishttps://www.docdataflow.com/wiki/index.php/ProcessorProcessor2013-12-29T19:50:18Z<p>Kris: </p>
<hr />
<div>A processor is an [[Atomic adapter|atomic adapter]] which can substitute certain granules with different granules. <br />
<br />
Inside the processor, there is some programming logic which will select particular types of input [[Granule|''granule'']], and returns a different granule instead.<br />
<br />
The original [[Granule|''granules'']] are normally dropped from the data flow. They are substituted by the newly created 'processed' granules. <br />
<br />
In the Crawler system, once created, a granule is never modified. A granule is a 'constant' data entity. <br />
<br />
So, a processor can only ''substitute'' some of the input granules by newly created granules.<br />
<br />
An example: a processor could be set up to substitute any word granules with all uppercase word granules.<br />
<br />
Consider the following data flow which originated somewhere up-flow. This is the input to the example processor:<br />
<br />
<pre><br />
Word: This<br />
Word: is<br />
Word: a<br />
Word: paragraph<br />
Para: This is a paragraph<br />
Word: This<br />
Word: is<br />
Word: another<br />
Word: paragraph<br />
Para: This is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
The output could look like this:<br />
<br />
<pre><br />
Word: THIS<br />
Word: IS<br />
Word: A<br />
Word: PARAGRAPH<br />
Para: This is a paragraph<br />
Word: THIS<br />
Word: IS<br />
Word: ANOTHER<br />
Word: PARAGRAPH<br />
Para: This is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre></div>Krishttps://www.docdataflow.com/wiki/index.php/OutputOutput2013-12-29T19:43:11Z<p>Kris: Created page with "An output is an atomic adapter which will emit granules in some form to some external output. This could be a file, a string or some other form of external ..."</p>
<hr />
<div>An output is an [[Atomic adapter|atomic adapter]] which will emit granules in some form to some external output. This could be a file, a string or some other form of external output that can be processed by some other programming logic,... <br />
<br />
An output is normally a 'sink': it accepts data through its input connection, but emits no data through its output connection.<br />
<br />
Any [[Granule|''granules'']] that come in are dropped from the data flow.</div>Krishttps://www.docdataflow.com/wiki/index.php/FilterFilter2013-12-28T04:20:58Z<p>Kris: </p>
<hr />
<div>A filter is an [[Atomic adapter|atomic adapter]] which can selectively strip [[Granule|''granules'']] from the data flow.<br />
<br />
Inside the filter, there is some programming logic which checks every input [[Granule|''granule'']], and returns a pass/fail answer.<br />
<br />
Any [[Granule|''granules'']] that fail the test are dropped from the data flow.<br />
<br />
An example: a filter could be set up to drop any word granules that contain a word that starts with a lower case letter.<br />
<br />
Consider the following data flow which originated somewhere up-flow. This is the input to the example filter:<br />
<br />
<pre><br />
Word: This<br />
Word: is<br />
Word: a<br />
Word: paragraph<br />
Para: This is a paragraph<br />
Word: This<br />
Word: is<br />
Word: another<br />
Word: paragraph<br />
Para: This is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
If a filter is set up to drop any words starting with a lower case letter, the output of the filter would become:<br />
<br />
<pre><br />
Word: This<br />
Para: This is a paragraph<br />
Word: This<br />
Para: This is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre></div>Krishttps://www.docdataflow.com/wiki/index.php/Granule_OrderingGranule Ordering2013-12-28T04:14:28Z<p>Kris: Created page with "A ''granule ordering'' is a mechanism to order granules. When a ''disassembler'' takes a larger granule apart into smaller sub-granules, it can be asked to ..."</p>
<hr />
<div>A ''granule ordering'' is a mechanism to order granules. <br />
<br />
When a [[Disassembler|''disassembler'']] takes a larger granule apart into smaller sub-granules, it can be asked to re-order these sub-granules using a specific ''granule ordering''. If no ''granule ordering'' is defined, the sub-granules are output in their natural order. If a ''granule ordering'' is defined, they are re-ordered before they are injected into the data flow.</div>Krishttps://www.docdataflow.com/wiki/index.php/Class_identifierClass identifier2013-12-28T04:02:55Z<p>Kris: </p>
<hr />
<div>Crawler is built using object-oriented programming techniques. <br />
<br />
Many of the Crawler concepts map directly onto underlying object classes. <br />
<br />
For example, every [[Adapter|''adapter'']] type, every [[Granule|''granule'']] type, every [[Granule Ordering|''granule ordering'']]... is programmatically represented by an object class.<br />
<br />
For [[Granule|''granules'']] and [[Granule Ordering |''granule orderings'']], these underlying object class are assigned a unique alphanumerical identifier, called the 'class identifier'. <br />
<br />
These class identifiers are 'reverse domain names': by taking an existing domain name (rorohiko.com), reversing it, and add a number of period-separated strings to it, we can create identifiers that can be nearly certainly unique. Other companies using Crawler can define class identifiers that are guaranteed to be different from any other company's class identifiers as long as they stick to this 'reverse domain name' rule, and take some care in the choice of the identifiers.<br />
<br />
There are many granule types; some examples:<br />
* com.rorohiko.granule.indesign.color<br />
* com.rorohiko.granule.folder<br />
* com.rorohiko.granule.text.word<br />
* ...<br />
<br />
Another example. These are two predefined [[Granule Ordering |''granule ordering'']] class identifiers:<br />
* com.rorohiko.ordering.frame.vertical<br />
* com.rorohiko.ordering.frame.horizontal<br />
<br />
In Crawler's [[Configuration File|''config files'']], it is often necessary to refer to specific object classes. That's where these class identifiers come in: they are a convenient way to refer to the use of a particular [[Granule|''granule'']] type or [[Granule Ordering |''granule ordering'']] in a text based [[INI file]].<br />
<br />
There is also a short-hand notation. <br />
<br />
For granules, the prefix 'com.rorohiko.granule.' can be dropped, so in [[INI file]] entries where a granule type is expected, the entries<br />
<br />
<pre><br />
com.rorohiko.granule.text.word<br />
text.word<br />
</pre><br />
<br />
are equivalent.<br />
<br />
For granule sorts, the prefix 'com.rorohiko.ordering.' can be dropped, so in [[INI file]] entries where a granule ordering is expected, the entries<br />
<br />
<pre><br />
com.rorohiko.ordering.frame.vertical<br />
frame.vertical<br />
</pre><br />
<br />
are equivalent.</div>Krishttps://www.docdataflow.com/wiki/index.php/GranuleGranule2013-12-28T03:37:11Z<p>Kris: </p>
<hr />
<div>= Overview =<br />
Granules are the 'chunks of data' that flow through the network of [[Adapter|''adapters'']] defined by the [[Personality|''personality'']].<br />
<br />
A granule can represent any quantity of data. It could represent anything from a single bit to a complete database with all its content, or an even larger clump of data. <br />
<br />
When granules flow through a network of [[Adapter|''adapters'']] they are often split into smaller granules by [[Disassembler|''disassemblers'']].<br />
<br />
Smaller granules are often collated into larger granules by [[Assembler|''assemblers'']].<br />
<br />
= Predefined Base Granule Types =<br />
<br />
It is impossible to predefine all possible kinds of granule types that could be handled by Crawler.<br />
<br />
As new document formats are added to the system, new granule types will need to be introduced to correctly capture the document data inside those as-of-yet unsupported document types. <br />
<br />
This is accepted and expected in the Crawler system: document-type specific [[Disassembler|''disassemblers'']] are allowed to add new granule types to the system.<br />
<br />
When adding new granule types, care must be taken to relate the new granule types back to one of the predefined base granules whenever possible.<br />
<br />
So, if a document type XYZ has a concept of a 'paragraph', the document support might introduce a new granule type 'XYZ_ParagraphGranule'. This XYZ_ParagraphGranule should then be a more specialized version of the predefined [[ParagraphGranule]]. In other words, XYZ_ParagraphGranule should have all the features of [[ParagraphGranule]], plus some XYZ-specific features.<br />
<br />
Some of the base granule types below will not make sense for some/most document types; in that case, they should simply be ignored. <br />
<br />
[[AppGranule]]<br />
<br />
[[ColorGranule]]<br />
<br />
[[DocumentGranule]]<br />
<br />
[[FontGranule]]<br />
<br />
[[FrameGranule]]<br />
<br />
[[PageGranule]]<br />
<br />
[[SpreadGranule]]<br />
<br />
[[StyleGranule]]<br />
<br />
= Specialized Base Granule Types =<br />
<br />
== FrameGranule ==<br />
<br />
[[GraphicFrameGranule]]<br />
<br />
[[TextFrameGranule]]<br />
<br />
== StyleGranule ==<br />
<br />
[[CharacterStyleGranule]]<br />
<br />
[[ParagraphStyleGranule]]<br />
<br />
== TextGranule ==<br />
<br />
[[ParagraphGranule]]<br />
<br />
[[StoryGranule]]<br />
<br />
[[TextRunGranule]]<br />
<br />
[[WordGranule]]<br />
<br />
== Developer Info ==<br />
<br />
[[Granule Code|Granule Documentation]]</div>Krishttps://www.docdataflow.com/wiki/index.php/ExporterExporter2013-12-28T03:23:08Z<p>Kris: /* View Exporter */</p>
<hr />
<div>An exporter is an [[Atomic adapter|''atomic adapter'']]. <br />
<br />
A [[Personality|''personality'']] in the Crawler system can be built around an exporter. It is not a requirement to do so, but using an exporter can make things easier by providing some useful pre-made functionality.<br />
<br />
= Parent-child personalities =<br />
<br />
The exporter is a high-level [[Adapter|''adapter'']] which coordinates parent-child relations between [[Personality|''personalities'']]. <br />
<br />
It is possible to create a new personality by tweaking an existing personality. This is achieved by overriding certain settings or adding to the existing settings in the configuration files. <br />
<br />
This 'personality-inheritance' mechanism is handled by the exporter adapter. For example, if the default XHTML/CSS personality almost fits the bill, it's possible to derive a new personality from it, and enhance or change the way the derived personality behaves.<br />
<br />
= Debug Monitors =<br />
<br />
A second function of the exporter is to manage the log level and the injection of [[Debug Monitor|''debug monitors'']] into the adapter network. The exporter reads the top-level configuration file to determine which [[Named Adapter|''named adapters'']] need to be monitored, and it will dynamically add the necessary debug monitors to the adapter network.<br />
<br />
This sample top-level config file has a sample [debug] section which enables a debug monitor on the input to the "inputSplitter" [[Named Adapter|''named adapter'']] in the network.<br />
<br />
<pre><br />
[conditionals]<br />
<br />
selectors = xhtml<br />
<br />
[main]<br />
<br />
personalityConfig?xhtml = "./Personalities/XHTML/config.ini"<br />
personalityConfig?text = "./Personalities/Text/config.ini"<br />
personalityConfig?hyperlinks = "./Personalities/Hyperlinks/config.ini"<br />
<br />
# ********************************************************************************<br />
<br />
[debug]<br />
<br />
# <br />
# Turn on/off logging and target specific adapters by name<br />
#<br />
<br />
debugMonitoring = true<br />
monitorAdapters = inputSplitter<br />
<br />
#<br />
# Logging.LOG_ERROR = 1;<br />
# Logging.LOG_WARNING = 2;<br />
# Logging.LOG_NOTE = 3;<br />
# Logging.LOG_DEBUG = 4;<br />
# Logging.LOG_TRACE = 5;<br />
#<br />
logLevel = 5<br />
<br />
</pre><br />
<br />
= View Exporter =<br />
<br />
In many cases, personalities are built around a [[ViewExporter|''view exporter'']], rather than an exporter. A view exporter is a specialized exporter which supports multiple [[View|''views'']]. <br />
<br />
Views are alternate data flows. <br />
<br />
For example, an InDesign-to-XHTML/CSS conversion will use two separate views: an XHTML view and a CSS view.<br />
<br />
The document will first be disassembled by a set of disassemblers, and the resulting output data flow will be duplicated into two separate data flows by means of an [[Input Splitter|''input splitter'']]. <br />
<br />
One of the two data flows will be routed through the XHTML view, the other through the CSS view. The end result will be two separate data files: one file produced by the XHTML view, the other will be produced by the CSS view. Both files will be based on the same input data, but they will go through separate construction mechanisms.</div>Krishttps://www.docdataflow.com/wiki/index.php/Debug_MonitorDebug Monitor2013-12-28T03:06:55Z<p>Kris: </p>
<hr />
<div>A debug monitor is an [[Atomic adapter|''atomic adapter'']]. <br />
<br />
It does not change the data flow: any input is passed through to the output unmodified.<br />
<br />
Its function is to monitor the data flow and output some logging info to a log output somewhere. This could be a console log or a log file or some other form of log output.<br />
<br />
Debug monitors can be switched on or off. During normal operations they'll normally be switched off, but when necessary, they can be switched on to help get an insight into the data flow, and help diagnose issues by giving an 'inside look' in the flow of granules passing through the adapter network.<br />
<br />
In the Crawler system, there are provisions to easily attach a debug monitor to any [[Named adapter|''named adapter'']] in the adapter network by means of the [['Top-level config|''top-level config file'']].</div>Krishttps://www.docdataflow.com/wiki/index.php/DisassemblerDisassembler2013-12-28T02:49:09Z<p>Kris: </p>
<hr />
<div>A disassembler is an [[Atomic adapter|''atomic adapter'']].<br />
<br />
Disassemblers accept granules via their input connection. <br />
<br />
A disassembler will normally pass through ''all'' granules it receives. <br />
<br />
It will also break some of the input granules down into smaller granules, and it will 'inject' these additional granules into the granule stream. <br />
<br />
The smaller granules are injected ''before'' the larger input granule from which they have been extracted. <br />
<br />
Granules that are of no interest to the disassembler are normally passed through unmodified.<br />
<br />
For example, when a disassembler breaks apart a 'paragraph' granule into a series of 'word' granules, the output of the disassembler will typically consist of a stream of word granules, followed by the original paragraph granule from which the word granules were extracted.<br />
<br />
An [[Assembler|''assembler'']] further down the data flow will often ignore such paragraph granule as far as its contents go. Instead it will collect the word granules, and wait for the paragraph granule solely as a terminating trigger to signify the series of word granules is complete.<br />
<br />
An example input with three granules could look like this:<br />
<br />
<pre><br />
Para: this is a paragraph<br />
Para: this is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
A paragraph disassembler might convert this input into the following output:<br />
<br />
<pre><br />
Word: this<br />
Word: is<br />
Word: a<br />
Word: paragraph<br />
Para: this is a paragraph<br />
Word: this<br />
Word: is<br />
Word: another<br />
Word: paragraph<br />
Para: this is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
In other words: under normal circumstances, a disassembler will only ''add'' to the data flow. It won't take granules away.</div>Krishttps://www.docdataflow.com/wiki/index.php/AssemblerAssembler2013-12-28T02:35:30Z<p>Kris: </p>
<hr />
<div>An assembler is an [[Atomic adapter|''atomic adapter'']]. <br />
<br />
Assemblers accept granules via their input connection. <br />
<br />
They then use some of these input granules to construct collated granules. <br />
<br />
Typically, assemblers will rely on the presence of certain 'trigger granules' in the input stream, to help them decide when they have all the necessary data needed to finish a constructed granule. <br />
<br />
When the constructed granule is ready, it is released via the assembler's output connection.<br />
<br />
Assemblers will often drop the smaller granules they used from the data flow, and only emit the newly constructed granules.<br />
<br />
For example, an assembler could be collecting 'word granules', and string these 'word granules' together into some new 'word group' granule. <br />
<br />
As time goes, the assembler needs to know when the 'word group' under construction is complete. The presence in the input stream of some other type granule (e.g. a 'text frame' granule or a 'paragraph' granule) will typically be the trigger to release the newly constructed 'word group' granule, and get ready to construct the next 'word group' granule.<br />
<br />
In a typical Crawler workflow, a [[Disassembler|''disassembler'']] will normally only add to the data flow. It won't take granules away. In other words: the larger granules that are broken apart by [[Disassembler|''disassemblers'']] are not stripped away and remain part of the data flow. <br />
<br />
Assemblers, on the other hand, do take granules away.<br />
<br />
For example, when a [[Disassembler|''disassembler'']] breaks apart a 'paragraph' granule into a series of 'word' granules, the output of the disassembler will typically consist of a stream of word granules, followed by the original paragraph granule from which the word granules were extracted. <br />
<br />
An assembler further down the data flow will ignore the content of such paragraph granules. Instead it will collect the word granules, and wait for the paragraph granule as a terminating trigger to signify the series of word granules is complete.<br />
<br />
An example: consider the following data flow emitted by a [[Disassembler|''disassembler'']] further up the data flow:<br />
<br />
<pre><br />
Word: this<br />
Word: is<br />
Word: a<br />
Word: paragraph<br />
Para: this is a paragraph<br />
Word: this<br />
Word: is<br />
Word: another<br />
Word: paragraph<br />
Para: this is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
An assembler might be set up to count the word granules, and emit a word count granule each time it is triggered by a paragraph granule. <br />
<br />
This example assembler might convert the data flow into the following:<br />
<br />
<pre><br />
WordCount: 4<br />
Para: this is a paragraph<br />
WordCount: 5<br />
Para: this is another paragraph<br />
TextFrame: pos (10, 20), width 20, height 80<br />
</pre><br />
<br />
i.e. it has dropped the word granules, and emits a new 'word count' granule each time it 'sees' a paragraph granule pass by.</div>Krishttps://www.docdataflow.com/wiki/index.php/Atomic_adapterAtomic adapter2013-12-28T02:31:34Z<p>Kris: </p>
<hr />
<div>An atomic [[Adapter|''adapter'']] is an adapter which is not constructed from smaller adapters. Example atomic adapters are the [[Assembler|''assembler'']], the [[Filter|''filter'']], the [[Debug Monitor|''debug monitor'']], the [[Output|''output'']], the [[Processor|''processor'']]...</div>Krishttps://www.docdataflow.com/wiki/index.php/Composite_AdapterComposite Adapter2013-12-28T02:27:41Z<p>Kris: </p>
<hr />
<div>A composite [[Adapter|''adapter'']] is an adapter that is constructed from smaller sub-adapters (themselves either composite or atomic).<br />
<br />
Inside a composite adapter, the sub-adapters can be arranged in different ways. For example, they could be chained together in an [[Adapter Chain|''adapter chain'']] or they could all be presented as mutually exclusive selections in a a [[Selector|''selector'']]</div>Krishttps://www.docdataflow.com/wiki/index.php/Adapter_ChainAdapter Chain2013-12-28T02:15:08Z<p>Kris: </p>
<hr />
<div>An adapter chain is a [[Composite Adapter|''composite adapter'']] which consists of a chained sequence of 'sub-adapters'. It offers a way to treat a sequence of adapters as a single new 'super-adapter'.<br />
<br />
[[File:Adapterchain.png|800px]]<br />
<br />
The input to the adapter chain is passed through to the head adapter of the chain. The output coming out of the tail adapter is passed through as the output of the adapter chain.</div>Krishttps://www.docdataflow.com/wiki/index.php/ContextContext2013-12-28T01:16:13Z<p>Kris: /* Context hierarchies */</p>
<hr />
<div>=Context contents=<br />
<br />
A context is a collection of data that is relevant to a particular [[Granule|''granule'']]. <br />
<br />
For example, when a text frame in some document is represented by a [[Granule|''granule'']] for processing inside a Crawler personality, it is accompanied by its context. <br />
<br />
That context will include information like: <br />
* what page is the text frame on? <br />
* what is the text frame position on that page? <br />
* what document is that page on?<br />
* ...<br />
<br />
The data relating to the text frame is split into two parts: <br />
* the granule itself, with its own raw data ‘inside’<br />
* any additional information about the granule and its surroundings is stored in the context. <br />
<br />
The context contains all the other data that is not part of the granule, but is relevant to it.<br />
<br />
Once created, granules remain fixed and the data in them does not change. They often directly reflect properties and information extracted from the source document, and these remain constant.<br />
<br />
Contexts, on the other hand, are not fixed: as granules flow through various adapters, their context can accumulate additional data. It's normal for a granule to start out with an almost empty context. As it progresses through the various adapters, the context will collect more and more data, until the granule is either output or absorbed into a larger granule.<br />
<br />
In a Crawler workflow, [[Adapter|''adapters'']] and [[Granule|''granules'']] are fixed, constant entities: once created they don't change. Any changes that accumulate during the process are tracked in a context.<br />
<br />
=Context hierarchies=<br />
<br />
Contexts are arranged into a hierarchy.<br />
<br />
[[File:Context.png|800px]]<br />
<br />
Example: when we look at a 'text frame' granule, it will probably be a sub-granule of a larger ‘page’ granule. <br />
<br />
The 'page' granule itself is a sub-granule of a larger ‘document’ granule. <br />
<br />
Each of those granules will have its own context. There will be a context for the document granule, and another context for the page granule. <br />
<br />
The page context will be a subcontext of the document context: i.e. the page context will include all info from the document context, plus its own specific data.<br />
<br />
The text frame context will be a subcontext of the page context: i.e. the text frame context will include all info from the page context, plus its own specific data.<br />
<br />
The various adapters in a workflow will often pass information to one another by means of the context.<br />
<br />
During the Crawler process, we'll often refer to certain information by name. For example, when processing a [[Template Snippet|''template snippet'']] the template text contains placeholders, like $$XPOS$$. <br />
<br />
Such placeholders are interpreted within the relevant context. A single snippet will normally be used to process many individual granules; each of the granules will come with its own context, and placeholders like $$XPOS$$ will be replaced by different values every time, depending on what the context dictates for the value of XPOS.<br />
<br />
If a certain placeholder is not defined within a particular context, Crawler will check the parent context, and the parent's parent and so on.<br />
<br />
There is a top-level context, the [[App Context|''app context'']]. This is a 'root context' which serves as the ultimate parent to all contexts that exist during the process. This [[App Context|''app context'']] stores system-wide information that is to be shared by all contexts.</div>Krishttps://www.docdataflow.com/wiki/index.php/ComponentsComponents2013-12-28T00:28:30Z<p>Kris: </p>
<hr />
<div>= Base Components =<br />
<br />
[[Adapter]]<br />
<br />
[[Granule]]<br />
<br />
[[Context]]<br />
<br />
[[Granule Ordering|Granule ordering]]<br />
<br />
[[Class identifier]]<br />
<br />
= Mechanisms =<br />
<br />
[[Granule Acceptance]]<br />
<br />
= Complex Components =<br />
<br />
[[ViewExporter]]</div>Krishttps://www.docdataflow.com/wiki/index.php/Pre-built_PersonalitiesPre-built Personalities2013-12-28T00:27:11Z<p>Kris: </p>
<hr />
<div>[[Markdown]]: Crawler.ID2MD<br />
<br />
[[EPUB]]: Crawler.ID2EPUB<br />
<br />
[[Fixed Layout EPUB]]: Crawler.ID2FL<br />
<br />
[[XHTML]]: Crawler.ID2XHTML<br />
<br />
[[Reporting]]: Crawler.ID2Report<br />
<br />
[[Document Statistics]]: Crawler.ID2Stats</div>Krishttps://www.docdataflow.com/wiki/index.php/INI_fileINI file2013-12-27T06:11:50Z<p>Kris: /* Conditional entries */</p>
<hr />
<div>Crawler's INI files are based on a loosely defined de-facto standard; more info can be found [http://en.wikipedia.org/wiki/INI_file ''here''].<br />
<br />
== Basic properties ==<br />
<br />
The Crawler INI files have the following properties:<br />
* Section and entry names are case-insensitive by default (but Crawler has built-in support for case-sensitive INI files should the need arise).<br />
* Comment lines are supported. Prefixing a line with a '#' or a ';' makes it a comment line. In-line comments are not supported: a single line is either a comment line or it is not. Comments on lines with data are not supported. For example:<br />
<pre><br />
# This is a comment line<br />
entry = test # test<br />
</pre><br />
means to set ''entry'' to ''"test # test"''. The trailing # test is not seen as a comment.<br />
<br />
* Blank lines are allowed (and ignored)<br />
* If an INI file has entries that are not preceded by a section line, then those entries are assumed to be in a default section ''[main]''<br />
* Duplicate entry names are allowed, and provide an 'override' mechanism. If an entry appears twice, the second appearance will 'win'.<br />
* Entry values can be enclosed between double quotes (") in which case backslashes are used as an escape character as defined in JavaScript. If no double quotes are present, backslashes are not interpreted as escapes. When no double quotes are present, leading and trailing spaces are removed. The following entries are all equivalent:<br />
<pre><br />
data = my data<br />
data = "my data"<br />
data=my data<br />
data="my\x20data"<br />
data="my\u0020data"<br />
</pre><br />
<br />
== Enhancements ==<br />
<br />
Crawler INI files have a few Crawler-specific enhancements.<br />
<br />
=== Parent-child files===<br />
<br />
In a number of Crawler personalities, INI files are arranged in a parent-child relationship. <br />
<br />
* It is possible to derive a new personality from an existing personality. This is achieved by adding a special section ''[parent]'' with a single entry ''path'' to the child INI. This entry has the path to the parent INI. The path can be absolute or relative. Relative paths are interpreted relative to the folder that contains the child INI. Forward slashes are allowed on Windows and are considered equivalent to backward slashes.<br />
<br />
<pre><br />
[parent]<br />
<br />
#<br />
# Parent personality: This personality is the same as XHTML but with added/overridden stuff<br />
#<br />
<br />
path = "../XHTML/config.ini"<br />
</pre><br />
<br />
* Some personalities use a nested folder structures where INI files in the 'inner' folders implicitly use the INI files in the outer folders as parent files. When two INI files have a parent-child relation, the child file 'inherits' all the contents of the parent's INI file. The child-INI can then either<br />
** override certain entries in the parent INI (by repeating the same entry name and section name, and providing a different value)<br />
** perform string concatenation<br />
<br />
=== String concatenation===<br />
When an entry occurs multiple times in the same INI file or in a parent-child INI file arrangement, the use of '+=' is used to allow comma-separated string concatenation.<br />
<pre><br />
dataEntry = "some data"<br />
...<br />
dataEntry += "some more data, some more more data"<br />
</pre><br />
<br />
This will set the entry ''dataEntry'' to "some data, some more data, some more more data". Comma's are inserted between the concatenated values.<br />
=== Auto-increment ===<br />
INI files are not well suited for managing tabular, repetitive data. <br />
<br />
The Crawler INI file format offers an enhancement to make repetitive (record-based) data entry easier. A line with just a ++ is interpreted as 'the first/next record follows'. The advantage is that it becomes easy to reorder complete data records in the INI file without needing to manually renumber individual entries.<br />
<br />
The two following sections are equivalent:<br />
<br />
<pre><br />
[tableData]<br />
<br />
name1=Kris<br />
hours1=120<br />
<br />
name2=John<br />
hours2=112<br />
extras2=12<br />
<br />
name3=Will<br />
hours3=99<br />
</pre><br />
<br />
<pre><br />
[tableData]<br />
<br />
++<br />
name=Kris<br />
hours=120<br />
<br />
++<br />
name=John<br />
hours=112<br />
extras=12<br />
<br />
++<br />
name=Will<br />
hours=99<br />
</pre><br />
<br />
=== Conditional entries ===<br />
<br />
Crawler-based INI files support conditional entries. This is done by means of a special, predefined section called '[conditionals]'.<br />
<br />
In the conditionals section, there is a single predefined entry ''selectors''. This entry is a list of comma-separated strings.<br />
<br />
Each of these strings is called a selector. Their presence or non-presence drives the conditional entries.<br />
<br />
Conditional entries have an entry name, followed by a question mark and a selector. These are only taken into account if the selector is present.<br />
<br />
<pre><br />
[conditionals]<br />
<br />
selectors = xhtml, flow<br />
<br />
[main]<br />
<br />
personalityConfig= "./Personalities/default.ini"<br />
personalityConfig?xhtml = "./Personalities/XHTML/config.ini"<br />
personalityConfig?text = "./Personalities/Text/config.ini"<br />
personalityConfig?hyperlinks = "./Personalities/Hyperlinks/config.ini"<br />
</pre><br />
<br />
This example sets the ''selectors'' entry to two separate selectors: ''xhtml'' and ''flow''. The ''personalityConfig'' entry will then be set to "./Personalities/XHTML/config.ini" because the selectors ''text'' and ''hyperlinks'' are not set, so the entry with the selector ''xhtml'' 'wins'. It will override the initial 'default' entry for ''personalityConfig''.<br />
<br />
=== Pre-defined Selectors ===<br />
<br />
There are a number of system-defined selectors. On Mac systems, the selector ''Mac'' is defined. On Windows systems, the selector ''Win'' is defined.<br />
<br />
This allows expressions like<br />
<br />
<pre><br />
...<br />
FILEPATH?Mac = ~/Desktop/output.txt<br />
FILEPATH?Win = C:\tmp\output.txt<br />
...<br />
</pre></div>Krishttps://www.docdataflow.com/wiki/index.php/Configuration_FileConfiguration File2013-12-27T05:44:47Z<p>Kris: </p>
<hr />
<div>Crawler's personalities are often configured by means of configuration files. <br />
<br />
There is no hard rule that dictates what format these configuration files should use. They could be text files, [[INI file|''INI files'']], XML-based files,...<br />
<br />
Because [[INI file|''INI files'']] are easy to understand for end-users, most of the pre-made personalities use INI-based configuration files.<br />
<br />
For more popular personalities, GUI-driven configuration tools might be provided, but in many cases, the GUI-development will lag behind on the functionality, in which case the next easiest method to (re)configure a Crawler personality is to edit one or more configuration files.</div>Krishttps://www.docdataflow.com/wiki/index.php/PersonalityPersonality2013-12-27T05:04:50Z<p>Kris: </p>
<hr />
<div>This is a high-level concept in a Crawler-based system.<br />
<br />
A ''personality'' will take input data in some shape or form, and will process it into output data in some other form.<br />
<br />
Crawler personalities are designed as 'add-ons' to the basic Crawler system.<br />
<br />
Personalities are made up out of simpler elements. <br />
<br />
A personality is composed of:<br />
* a workflow network of interconnected processing units called [[Adapter|''adapters'']]<br />
* a set of [[Configuration File|''configuration files'']]<br />
* a set of [[Template File|''template files'']]<br />
* a set of [[Formula File|''formula files'']]</div>Krishttps://www.docdataflow.com/wiki/index.php/PhilosophyPhilosophy2013-12-27T04:58:53Z<p>Kris: </p>
<hr />
<div>Crawler is an attempt to find a good balance between complexity and flexibility.<br />
<br />
Crawler is meant to be 'pokeable': it's OK to poke around in the configuration files. Often, all that is needed to get the desired results is some educated 'poking around'.<br />
<br />
The aim of the system's design is to allow a non-experienced user to confidently reconfigure a Crawler [[Personality|''personality'']] without needing to digest massive amounts of documentation beforehand.<br />
<br />
Crawler's [[Configuration File|''configuration files'']], [[Template File|''templates'']] and [[Formula File|''formula files'']] are all simple text files. <br />
<br />
Users are encouraged to open them and make small modifications to them, and should, in most cases, get the expected results.<br />
<br />
It is OK to make changes to a configuration file without fully understanding it. Most often, Crawler will behave as expected and produce the expected results.<br />
<br />
Users that desire total control or desire to build a custom personality from scratch cannot avoid having to study the Crawler documentation, and spend substantial time trying and experimenting. But unlike many other complex systems, many basic reconfigurations can be performed without much need for a manual.</div>Krishttps://www.docdataflow.com/wiki/index.php/AdapterAdapter2013-12-26T20:42:36Z<p>Kris: </p>
<hr />
<div>An adapter is a processing unit which (in most cases) has an input connection and an output connection. <br />
<br />
Data flows through an adapter in the shape of [[Granule|''granules'']].<br />
<br />
The adapter accepts [[Granule|''granules'']] through its input connection, and emits [[Granule|''granules'']] through its output connection.<br />
<br />
[[File:Adapter.png|800px]]<br />
<br />
Some adapters have additional output connections - e.g. a [[Splitter|''splitter'']] is an adapter which connects to many output adapters.<br />
<br />
Some adapters might not produce any output, and act as a sink. They might be collating their input data into a common data pool, where it can be picked up later by another adapter, or they might be sending the data they receive it to the outside world, and have no need to pass it on to an adapter further in the chain.<br />
<br />
= Atomic Adapters and Composite Adapters =<br />
<br />
A first way to classify adapters is by looking whether they are composed of sub-adapters or not: [[Atomic adapter|''atomic adapters'']] versus [[Composite Adapter|''composite adapters'']].<br />
<br />
=Base Adapter Types=<br />
<br />
Below some more basic adapter types, which serve as the basis for the real adapters that are used in the [[Personality|''personality'']]. <br />
<br />
[[Adapter Chain]]<br />
<br />
[[Assembler]]<br />
<br />
[[Debug Monitor]]<br />
<br />
[[Disassembler]]<br />
<br />
[[Exporter]]<br />
<br />
[[Filter]]<br />
<br />
[[Output]]<br />
<br />
[[Processor]]<br />
<br />
[[Selector]]<br />
<br />
[[Splitter]]<br />
<br />
[[Scripted]]</div>Krishttps://www.docdataflow.com/wiki/index.php/Formula_FileFormula File2013-12-26T20:18:39Z<p>Kris: Created page with "== Formula File == Formula files in Crawler are JavaScript-like files which allow defining placeholders as JavaScript functions."</p>
<hr />
<div>== Formula File ==<br />
<br />
Formula files in Crawler are JavaScript-like files which allow defining placeholders as JavaScript functions.</div>Krishttps://www.docdataflow.com/wiki/index.php/Template_FileTemplate File2013-12-26T20:12:30Z<p>Kris: Created page with "= Template File = Template files in Crawler come in many types. One of the most common types is the ''snippet template''. The concept of a Crawler templ..."</p>
<hr />
<div>= Template File =<br />
<br />
Template files in Crawler come in many types. One of the most common types is the [[Template Snippet|''snippet template'']].<br />
<br />
The concept of a Crawler template is to combine the data encapsulated in a granule with data stored in a file, and by means of a [[ProcessorAdapter|''processor adapter''] combine the two into a new granule. There are many ways this could be done: the format of the template file depends on the way the processor adapter is constructed.</div>Krishttps://www.docdataflow.com/wiki/index.php/Template_SnippetTemplate Snippet2013-12-26T20:09:55Z<p>Kris: </p>
<hr />
<div>== Snippets ==<br />
<br />
A snippet is a [[Template|''template'']] file. It is a text file with a .snippet file name extension.<br />
<br />
Inside the template file, there is a mix of boilerplate text and placeholders. An example: there could be a template ''maindoc.xhtml.snippet'' which could contain<br />
<br />
<pre><br />
<html><br />
$$HEAD$$<br />
$$BODY$$<br />
</html><br />
</pre><br />
<br />
The ''$$HEAD$$'' and ''$$BODY$$'' would be placeholders which will be replaced by the text for some lower-level granules.<br />
<br />
This is only a sample: placeholders can take many shapes and forms, and the use of $$ as prefix/suffix is just what's used as the default placeholder pattern in Crawler.</div>Krishttps://www.docdataflow.com/wiki/index.php/OverviewOverview2013-12-26T03:59:02Z<p>Kris: </p>
<hr />
<div>== Crawler-based Products ==<br />
<br />
Crawler will become available in 2014 in two different flavors.<br />
<br />
* Middle: For more advanced workflows, we'll have customizable Crawler versions that will come with 'open source' personalities. <br />
<br />
We'll also be able to provide training and ongoing support for developing or customizing personalities. These Crawler versions are geared to be deployed in a server setup. The anticipated applications are automated conversions, automated web publishing, and automated back-end database updates.<br />
<br />
* High end: For the most advanced setups we can also provide 'fully open source' versions of Crawler. This will allow seamless integration of Crawler into an existing workflow. This type of integration always comes with training and ongoing support.<br />
<br />
Contact [mailto:sales@rorohiko.com sales@rorohiko.com] for more info.<br />
<br />
== Overview ==<br />
<br />
Crawler is designed along principles that are similar to the ones found in the [[http://en.wikipedia.org/wiki/Dataflow_programming Data Flow Programming]] paradigm.<br />
<br />
Crawler all by itself does not perform any useful function. In order to become usable it needs to be extended with a [[Personality|''personality'']]. <br />
<br />
The selected ''personality'' determines what function Crawler will perform.<br />
<br />
== Personality ==<br />
<br />
One of the high-level components in a Crawler-based system is called a [[Personality|''personality'']].<br />
<br />
A [[Personality|''personality'']] is a high-level Crawler component which will take input data in some shape or form, and will process it into output data in some other form.<br />
<br />
A few examples:<br />
* InDesign-to-XHTML/CSS: takes in InDesign documents or books and outputs XHTML/CSS files. <br />
* InDesign-to-EPUB: takes in InDesign documents or books, outputs EPUB. <br />
* InDesign-to-Database input: takes in InDesign document or books, and updates a database with information extracted from the document(s).<br />
<br />
When processing, input document(s) are pushed through a network of [[Adapter|''adapters'']] provided by the personality; data is flowing in and out of the ''adapters''. <br />
<br />
[[File:Sampleexporter1.png|800px]]<br />
<br />
If the ''personality'' were a hive, then the ''adapters'' would be the worker bees.<br />
<br />
A ''personality'' is somewhat reminiscent of a [http://en.wikipedia.org/wiki/Rube_Goldberg_machine Rube Goldberg-machine].<br />
<br />
The initial ''adapters'' process the document, and take it apart into ever smaller chunks of data. <br />
<br />
The reverse also happens: some adapters collate smaller chunks back into larger chunks. <br />
<br />
These 'chunks of data' are referred to as [[Granule|''granules'']].<br />
<br />
Example: an adapter might take in a paragraph ''granule'' and split it into individual word ''granules''. Another adapter further downstream might take a number of word ''granules'' and concatenate them back into a paragraph ''granules''.<br />
<br />
Some ''adapters'' perform some kind of processing on the ''granules'' they receive; they might change them in some way, discard them, count them, reorder them, create new ''granules'' based on previous ''granules''...<br />
<br />
Other ''adapters'' construct new ''granules'' based on [[Template Snippet|''template snippets'']]. For example, some ''adapter'' could take in some raw text, and combine this raw text with a ''template snippet'' into some XML formatted ''granule''.<br />
<br />
The general idea is that the input data is broken apart into smaller entities, and then these smaller entities are put back together again a different shape, possibly performing a document conversion in the process.</div>Krishttps://www.docdataflow.com/wiki/index.php/OriginalMainPageOriginalMainPage2013-12-26T03:33:43Z<p>Kris: Created page with "'''MediaWiki has been successfully installed.''' Consult the [//meta.wikimedia.org/wiki/Help:Contents User's Guide] for information on using the wiki software. == Getting st..."</p>
<hr />
<div>'''MediaWiki has been successfully installed.'''<br />
<br />
Consult the [//meta.wikimedia.org/wiki/Help:Contents User's Guide] for information on using the wiki software.<br />
<br />
== Getting started ==<br />
* [//www.mediawiki.org/wiki/Manual:Configuration_settings Configuration settings list]<br />
* [//www.mediawiki.org/wiki/Manual:FAQ MediaWiki FAQ]<br />
* [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]<br />
* [//www.mediawiki.org/wiki/Localisation#Translation_resources Localise MediaWiki for your language]</div>Krishttps://www.docdataflow.com/wiki/index.php/Main_PageMain Page2013-12-26T00:42:50Z<p>Kris: </p>
<hr />
<div>Crawler is a generic document processor engine. It can be used for document conversion, as well as reporting and statistics.<br />
<br />
Crawler is work-in-progress. The first release works with Adobe InDesign CS5 Server (or Adobe InDesign CS5) or higher as the source format. Future versions will support additional source formats.<br />
<br />
This Wiki is currently under construction, and it changes/grows day by day.<br />
<br />
== Predecessor: ePubCrawler ==<br />
<br />
Crawler was developed from scratch, but it is based on a range of ideas that emerged during the development of the ePubCrawler software (http://www.rorohiko.com/epubcrawler). <br />
<br />
ePubCrawler won't be developed further. At some point in time, Crawler will supersede ePubCrawler, and EPUB will simply be one of the target formats supported by Crawler.<br />
<br />
== Crawler ==<br />
<br />
[[Overview]]<br />
<br />
[[Philosophy]]<br />
<br />
[[Components]]<br />
<br />
[[Pre-built Personalities]]<br />
<br />
[[Custom Personalities]]</div>MediaWiki default