Help · Scientific Contribution Graph Explorer

This is a browser-based explorer for the Scientific Contribution Graph: a large knowledge graph that maps how each scientific contribution builds on prior contributions, and how later contributions in turn build on it.

What is this?

The Scientific Contribution Graph extracts the scientific contributions from each paper in a large corpus and links them together by prerequisite relationships. For example, if a recent paper develops some contribution X, and X was built in part from contributions A, B, and C in earlier papers, the graph records each of those A→X, B→X, C→X links — so you can trace how later technologies were built from earlier ones. The current release covers ~230k papers, ~2M contributions, and ~12.5M prerequisite edges, drawn mostly from open-access NLP and AI literature.

Each contribution has a name, a description, a list of types (e.g. dataset, method, analysis), and a list of prerequisite contributions it builds on. Each prerequisite has a strength (strong if the prerequisite is directly required; weak if it's only loosely related).

This explorer lets you:

Search any paper by title.
See every contribution in that paper, ranked by downstream impact (how many later contributions build on it, directly or indirectly).
Render a forward or backward citation graph for any contribution, with live-adjustable knobs.

For background on the data, methodology, and impact metric, see the paper, or the GitHub repo (which also exposes the full Python API, beyond what this demo surfaces).

How to use it

Search for a paper. Type at least two characters in the Search papers by title… box on the left. Matches appear live; click a row to select it.

Pick a contribution. The top of the main panel lists every contribution in the selected paper. The Impact column shows the total count of later contributions that build on it; Dampened applies a reciprocal-rank-by-depth weighting (direct citations count as 1, depth-2 contributions count as 0.5, depth-3 as 0.33, …). Click any row to render its citation graph.

Tune the visualization. The sidebar on the right has live knobs — changes re-render automatically.

Direction — Backward shows what this contribution was built on; Forward shows what was built on it.
Layout — Tree (Graphviz, fastest), Tree (with edge labels) (annotates each edge with a short summary of the prerequisite relation), or Radial (force-directed; better for medium graphs).
Depth — how many hops to traverse. Higher values mean exponentially larger graphs.
Min children — drop nodes whose subtree has fewer than this many children. Useful for trimming peripheral leaves in big graphs. The first render of each contribution auto-bumps this to keep the initial view readable; once you touch the slider, your value is honored exactly.
Strong connections only — restrict to prerequisites that are directly required (excludes weak / loosely-related links).

Pan, zoom, and download. Scroll-wheel to zoom, drag to pan; the + / − / ⤢ toolbar in the top-right of the viz area zooms in/out and fits the view. Download SVG in the sidebar saves a vector copy of the current render.

Common issues

"Impact too large to compute quickly"

For very highly-cited contributions (e.g., BERT, Transformers), the impact metric crawl can take much longer than the UI's configured timeout budget. When that happens the contributions table shows large in the impact columns rather than freezing the page. You can still click any contribution to render its citation graph — the timeout only affects the impact tally. For exact numbers, use the Python API directly, which has no timeout.

"Timed out — graph too large for this web demo"

The render has a configurable timeout budget. If a crawl plus layout exceeds that, the underlying worker is killed and you see this message. To fit within the budget, try one or more of:

Lower Depth (depth 1 or 2 is usually enough).
Raise Min children (trims peripheral nodes).
Enable Strong connections only.
Switch to the Tree layout (it's the fastest of the three).

Again, for very large graphs the Python API has no timeout — you can render them locally with full control over layout parameters.

Why can't I find a specific paper?

The current corpus contains about a quarter million open-access papers, centred on natural language processing — so if the paper you're looking for is closed-access, sits in a different subfield, or is very new, it may simply not be in the release yet. The title search is also a token-coverage match (not a semantic search): try just a few distinctive words from the title. If you have the paper's Semantic Scholar corpus_id, you can also call GET /api/paper/{corpus_id} directly.

Why does the impact score look lower than I'd expect?

The downstream-impact tally only counts contributions in the current corpus that build on the target. Papers outside the open-access NLP focus, or that haven't been crawled yet, won't be represented — so a foundational paper that is heavily cited in closed-access venues or in adjacent subfields will look smaller here than it really is. Coverage will continue to fill in as the crawl expands.

Why doesn't this paper show contributions I know it built on (or its full impact)?

Three reasons are usually at play:

Open-access focus. The crawl is over open-access papers, so coverage of the ACL Anthology is good, but coverage of the broader AI / ML literature is thinner. Prerequisites that point to closed-access work won't have matched-contribution links into the graph.
The original crawl ran in the backward (citation) direction. That is, for each paper it identifies the prior contributions that paper was built on. This is optimised for the "what does this build on?" question — the technological roadmapping direction — rather than for forward-impact measurement, so forward-impact may be systematically undercounted in this initial release.
The current (continuing) crawl mixes both directions. Subsequent expansion passes interleave forward and backward crawls to balance technological roadmapping and impact assessment — so coverage in the forward direction will keep improving with each release.

Where is all the data for each contribution?

Each contribution has a rich schema — name, description, types, sections, prerequisites with their own descriptions, explanations, strengths, matched references, and more — and only a small slice of that is shown in the visualizations and tables here. For the full per-contribution payload, use the Python API on GitHub, or download the release and read the source files directly — they're stored as easily-readable JSON.

Is the crawl ongoing?

Yes — the corpus is still being actively crawled and expanded, and we expect to update the public release at regular milestones. If a paper or contribution is missing today, it may well be present in a future release; check back periodically, or watch the GitHub repo for release announcements.

The graph looks empty or has only one node

Some contributions have no downstream/upstream contributions at the current depth, or all their links are weak and you've enabled Strong connections only. Try increasing Depth, lowering Min children, or turning off the strong-only checkbox.

The first render after picking a contribution is slow

The first crawl for a fresh contribution does the actual graph traversal; subsequent re-renders only re-layout. If you're going to explore a contribution thoroughly, expect the first render to take a few seconds, then later knob changes to be much faster.