DocsResearchWorkflow

The workflow, end to end.

Every research run is a 12-phase pipeline. Here is what each phase does and what it emits — the same view the run UI shows you, in document form. Designed to be deterministic in shape and recoverable across failures.

§ 01Discovery phases

The first half of the pipeline finds candidate material. The model never reads anything until the discovery phases have done their job.

1 · Analyzing query

The pipeline starts by parsing the user's question. Intent is classified (factual, comparative, exploratory, etc.) and entities are extracted. This phase emits a structured representation of the question that downstream phases plan against — not a paraphrase, an inventory.

2 · Planning

The planner produces between one and three sub-queries. For a single-fact question, one is usually enough; for a comparative or exploratory question, two or three give the search phase enough breadth without inflating cost. Each sub-query targets a distinct angle — never a paraphrase of the same thing.

3 · Searching

Each sub-query goes to a web search provider (Tavily). The maximum number of results per sub-query is configurable. The output is a flat list of candidate URLs with snippets and provider-supplied metadata.

4 · Deduping

Candidate URLs are normalized and deduplicated before any network cost is paid for scraping. Two URLs that resolve to the same canonical document are collapsed; tracking parameters are stripped; trivial variants are merged.

§ 02Extraction phases

The middle of the pipeline turns URLs into clean, screenable text.

5 · Scraping

Surviving URLs are scraped in parallel using Trafilatura. Concurrency is configurable; the default is tuned to be fast without overwhelming target sites. Scraping respects robots, retries on transient failures, and times out cleanly on dead pages.

6 · Fetching & text extraction

Scraped HTML is reduced to readable text. Boilerplate (nav, footers, comments) is stripped; the main content is extracted; metadata such as title, byline, and publication date is captured wherever the page exposes it.

7 · Screening

Low-quality pages are dropped. The screen enforces a minimum word count, filters out near-empty extractions, and tags pages from domains the system has learned to distrust. What survives is fed into scoring.

§ 03Evaluation phases

The selection phases choose which sources actually inform the report.

8 · Selecting top sources

Surviving sources are scored on three axes — credibility (signals about the publisher and authorship), relevance (how well the content matches the query), and recency (with the weight tunable based on the question). A composite overall score ranks them, and the top sources are passed forward. Each pick carries a short selection rationale explaining why it advanced.

9 · Deep analysis

For the surviving five to twelve sources, the engine performs structured extraction — claims, quotes, evidence, and counter-evidence. This is the only phase where the model reads source content directly. Every extracted item carries a back-reference to the source it came from.

§ 04Output phases

The final phases turn analysis into a finished, persisted deliverable.

10 · Citations

Citations are built for every selected source in every supported format — MLA, APA, Chicago Notes, Chicago Author-Date, IEEE, Harvard, BibTeX. Metadata gaps are filled with conventional fallbacks (n.d. for missing year, domain as publisher when no publisher field is available, etc.). See Citations for the full builder spec.

11 · Synthesis

The synthesizer writes the executive summary first and the full report second. Both reference the analysis output, not raw source text — so claims are grounded in the structured extractions, not in whatever the model happens to recall. References are inlined; the bibliography is the citations block from phase 10.

12 · Complete

The run is finalized. Final report, sources, citations, scores, and the full reasoning timeline are persisted to the database. Sources are saved idempotently to the user's library — re-running with overlapping sources is safe.

§ 05What you get

A completed run produces, retrievable through the UI or the API:

§ 06Checkpoint recovery

The pipeline checkpoints between phases. If a phase fails — a transient network error, a flaky scraping target, an ill-formed search response — the engine recovers from the last successful checkpoint rather than restarting from the top. This makes long-running queries durable: a single bad URL does not collapse the run.

For phases that produce partial output (scraping, deep analysis), checkpoints capture incremental progress, so a resumed run starts from where it stopped, not where it began.

TipThe phases are visible to you in real time through the run timeline UI and through the WebSocket event stream documented in API reference. There is no opaque step.