The workflow, end to end.
Every research run is a 12-phase pipeline. Here is what each phase does and what it emits — the same view the run UI shows you, in document form. Designed to be deterministic in shape and recoverable across failures.
§ 01Discovery phases
The first half of the pipeline finds candidate material. The model never reads anything until the discovery phases have done their job.
1 · Analyzing query
The pipeline starts by parsing the user's question. Intent is classified (factual, comparative, exploratory, etc.) and entities are extracted. This phase emits a structured representation of the question that downstream phases plan against — not a paraphrase, an inventory.
2 · Planning
The planner produces between one and three sub-queries. For a single-fact question, one is usually enough; for a comparative or exploratory question, two or three give the search phase enough breadth without inflating cost. Each sub-query targets a distinct angle — never a paraphrase of the same thing.
3 · Searching
Each sub-query goes to a web search provider (Tavily). The maximum number of results per sub-query is configurable. The output is a flat list of candidate URLs with snippets and provider-supplied metadata.
4 · Deduping
Candidate URLs are normalized and deduplicated before any network cost is paid for scraping. Two URLs that resolve to the same canonical document are collapsed; tracking parameters are stripped; trivial variants are merged.
§ 02Extraction phases
The middle of the pipeline turns URLs into clean, screenable text.
5 · Scraping
Surviving URLs are scraped in parallel using Trafilatura. Concurrency is configurable; the default is tuned to be fast without overwhelming target sites. Scraping respects robots, retries on transient failures, and times out cleanly on dead pages.
6 · Fetching & text extraction
Scraped HTML is reduced to readable text. Boilerplate (nav, footers, comments) is stripped; the main content is extracted; metadata such as title, byline, and publication date is captured wherever the page exposes it.
7 · Screening
Low-quality pages are dropped. The screen enforces a minimum word count, filters out near-empty extractions, and tags pages from domains the system has learned to distrust. What survives is fed into scoring.
§ 03Evaluation phases
The selection phases choose which sources actually inform the report.
8 · Selecting top sources
Surviving sources are scored on three axes — credibility (signals about the publisher and authorship), relevance (how well the content matches the query), and recency (with the weight tunable based on the question). A composite overall score ranks them, and the top sources are passed forward. Each pick carries a short selection rationale explaining why it advanced.
9 · Deep analysis
For the surviving five to twelve sources, the engine performs structured extraction — claims, quotes, evidence, and counter-evidence. This is the only phase where the model reads source content directly. Every extracted item carries a back-reference to the source it came from.
§ 04Output phases
The final phases turn analysis into a finished, persisted deliverable.
10 · Citations
Citations are built for every selected source in every supported format — MLA, APA, Chicago Notes, Chicago Author-Date, IEEE, Harvard, BibTeX. Metadata gaps are filled with conventional fallbacks (n.d. for missing year, domain as publisher when no publisher field is available, etc.). See Citations for the full builder spec.
11 · Synthesis
The synthesizer writes the executive summary first and the full report second. Both reference the analysis output, not raw source text — so claims are grounded in the structured extractions, not in whatever the model happens to recall. References are inlined; the bibliography is the citations block from phase 10.
12 · Complete
The run is finalized. Final report, sources, citations, scores, and the full reasoning timeline are persisted to the database. Sources are saved idempotently to the user's library — re-running with overlapping sources is safe.
§ 05What you get
A completed run produces, retrievable through the UI or the API:
- A final report — executive summary plus full report.
- A list of sources with credibility, relevance, recency, and overall scores, and a selection rationale for each.
- A citations bundle with all formats prebuilt.
- A reasoning chunk per phase — what the engine did, what it decided, and which inputs and outputs it used.
§ 06Checkpoint recovery
The pipeline checkpoints between phases. If a phase fails — a transient network error, a flaky scraping target, an ill-formed search response — the engine recovers from the last successful checkpoint rather than restarting from the top. This makes long-running queries durable: a single bad URL does not collapse the run.
For phases that produce partial output (scraping, deep analysis), checkpoints capture incremental progress, so a resumed run starts from where it stopped, not where it began.