Foundation tools for importing website content into that can be consumed in an Helix project.
helix-importer is composed of 2 main building blocks:
- explorer: crawl a website to construct a list of urls to be importer
- importer: construct an importer - for an input url, transform the DOM and convert it into a Markdown file
The folder ./src/wp contains WordPress specific utils and explorer methods.
Idea of an explorer is to crawl the site in order to collect a list of urls. This list of urls can then be imported.
Here is a basic sample:
import { WPContentPager, FSHandler, CSV } from '@adobe/helix-importer';
async function main() {
const pager = new WPContentPager({
nbMaxPages: 1000,
url: 'url to a WordPress site'
});
const entries = await pager.explore();
const csv = CSV.toCSV(entries);
const handler = new FSHandler('output', console);
await handler.put('explorer_results.csv', csv);
}
In this example, the WPContentPager extends the PagignExplorer which implements the 2 methods:
fetch
which defines how to fetch one page on resultsexplore
which extracts the list of urls present on that page
The final result is a list of urls that could be found on list of paged results given by the WordPress API /page/${page_number}
.
An importer must extends PageImporter and implement the fetch
and process
method. The general idea is that fetch
receives the url to import and is responsible to return the HTML. process
receives the corresponding Document in order to filter / rearrange / reshuffle the DOM before it gets processed by the Markdown transformer. process
computes and defines the list of PageImporterResource (could be more than one), each resource being transformed as a Markdown document.
Goal of the importer is to get rid of the generic DOM elements like the header / footer, the nav... and all elements that are common to all pages in order to get the unique piece(s) of content per page.
HTML2x methods (HTML2md
and HTML2docx
) are convienence methods to run an import. As input, they take:
URL
: URL of the page to importdocument
: the DOM element to importtransformerCfg
: object with the transformation "rules". Object can be either:{ transformDOM: ({ url, document, html, params }) => { ... return element-to-convert }, generateDocumentPath: ({ url, document, html, params }) => { ... return path-to-target; }}
for a single mapping between one input document / one output file{ transform: ({ url, document, html, params }) => { ... return [{ element: first-element-to-convert, path: first-path-to-target }, ...] }
for a mapping one input document / multiple output files (useful to generate multiple docx from a single web page)
The Helix Importer has a dedicated browser UI: see https://github.com/adobe/helix-importer-ui
npm i https://github.com/adobe/helix-importer
TODO: publish npm module
import { ... } from '@adobe/helix-importer';