Sources & the library.
Source discovery, scoring, dedup, and how completed runs save into your library. The mechanics that decide which pages inform a report and which ones are discarded.
§ 01Discovery
Every run begins with discovery — converting the user's question into a set of candidate URLs.
Search provider
The engine uses Tavily as its open-web search backend. Tavily is purpose-built for retrieval-style searches and returns results that are, on average, cleaner than what you'd get from scraping a consumer search engine.
Multi-query expansion
The planner phase produces between one and three sub-queries from the original prompt. Each sub-query is searched independently. Multi-query expansion is what lets ResearchAnything cover a comparative or exploratory question without inflating the cost of a single, overly-broad search.
§ 02Pre-scrape dedup
Before any URL is scraped, candidates from all sub-queries are merged and deduplicated. Dedup runs at three levels:
- URL normalization. Tracking parameters stripped, fragments dropped, host case folded.
- Canonical resolution. If two URLs resolve to the same canonical document (e.g. AMP variant vs main page), they're collapsed.
- Title fingerprinting. Different URLs with identical or near-identical titles from the same domain are treated as duplicates.
Dedup happens before network cost is paid. The savings are substantial when the planner produces overlapping sub-queries.
§ 03Domain reputation
Some domains produce reliably good content; others produce reliably noisy content. The engine maintains a reputation signal per domain that it learns over runs.
Excluded by default
A handful of domain classes are excluded from selection by default — generic forum sites, free blog hosts, and self-publishing platforms (Blogspot-style sites, Medium-style sites, and similar) where the floor on quality is too low for academic-grade research. These can still surface in search, but they won't make it past screening unless they're the only signal available.
Reputation learning
Domains gain reputation through successful inclusion in completed runs and lose it through dropped, low-quality, or contradicted appearances. The signal is global to the platform, not per-user, so a domain's behavior in your run benefits from its history elsewhere.
§ 04Scoring components
Surviving sources are scored on three axes plus a composite.
Credibility
Signals about the source's trustworthiness — domain reputation, presence of an author byline, presence of citations of its own, structured publication metadata, and other markers of editorial care.
Relevance
How closely the page's content matches the original question and the planner's sub-queries. Computed against the extracted text, not the title alone.
Recency
How recent the publication date is. The weight on recency is tunable based on the question type — a query about current events weights it heavily; a query about a historical fact weights it lightly.
Overall composite
The three axes combine into a single composite score that drives ranking. The composite is what the run sorts by; the individual components are visible to you so you can audit a source's standing.
§ 05Top-N selection
The selection phase keeps the top five to twelve sources by composite score. The exact N depends on the question's breadth — a narrow factual question uses fewer sources, an exploratory question uses more. Selection is deterministic for a given run; the same survivors will be picked if the run is recovered from a checkpoint.
§ 06Selection rationale
Every selected source carries a selectionWhy field — a short, plain-language explanation of why it advanced. Typical rationales mention a strong relevance match, a credible publisher, recency for a fast-moving topic, or unique evidence not present in the other survivors.
Rationales matter for review: an editor or analyst reading a finished report can ask "why is this source here?" and get an answer that doesn't require re-running the pipeline.
§ 07Idempotent save to library
When a run completes, every selected source is saved to the user's source library. The save is idempotent — saving a source you already have updates the existing record rather than creating a duplicate. Re-running with overlapping sources is safe.
You can also manually save sources from the run UI. The save endpoint dedupes by URL, so partial-saves followed by a full-run save converge to a single library record per source.
§ 08Re-using saved sources
Library sources can flow back into future runs and into projects. When a future run discovers a source that already lives in your library, the engine can re-use the cached extraction and metadata rather than scraping again — faster, and stable across runs.
Inside a project, library sources can be added directly without going through a research run at all. See Projects & documents.
§ 09Source metadata fields
The full source record includes the citation metadata fields plus a number of run- and scoring-specific fields:
- url — canonical URL.
- title — extracted title.
- author — byline or author list.
- publisher — publisher or site name.
- publishedAt — publication date when available.
- accessDate — scrape time.
- credibilityScore, relevanceScore, recencyScore, overallScore.
- selectionWhy — short selection rationale.
- citations — pre-built citations in every supported format.
For the API endpoints that return source records, see API reference.