Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe
< SMILA | Documentation
Revision as of 05:46, 22 May 2012 by Unnamed Poltroon (Talk) (New page: This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets.boilerpipe</tt>. == General == All pipelets in this bundle support the configurable er...)
This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.boilerpipe.
Contents
General
All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.
Read Type
- runtime: Parameters are read when processing records. Parameter value can be set per Record.
- init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.
org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet
Extracts text from an HTML input using the [Boilerpipe library|http://code.google.com/p/boilerpipe/].
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the HTML input is found in an attachment or in an attribute of the record |
outputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the plain text should be stored in an attachment or in an attribute of the record |
inputName | String | runtime | Name of attachment or attribute that contains the HTML input |
outputName | String | runtime | Name of attachment or attribute for plain text output |
encodingAttribute | String | runtime | Optional name of the attribute with the encoding of the input attachment. |
defaultEncoding | String | runtime | Optional fallback encoding, if anything else fails. |
filter | Sequence of String | init | A list of boiler pipe filters to use, may contain class names, or static method or static variable references (defaults to de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE). |
Example
Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding".
<proc:invokePipelet name="extractText"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" /> <proc:variables input="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="inputName">html</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="outputName">text</rec:Val> <rec:Val key="encodingAttribute">http.encoding</rec:Val> </proc:configuration> </proc:invokePipelet>