living_review package

Submodules

living_review.classifier module

classifier.py

This module provides semantic filtering and classification utilities for scientific papers in the context of machine-learning applications to particle accelerator physics.

It uses a pre-trained sentence-transformers model (MiniLM-L6-v2) to compute embeddings for papers, accelerator/ML reference queries, and category descriptions. Papers are first filtered for relevance (accelerators ∧ ML, excluding domain noise), and then assigned to categories using semantic similarity and keyword heuristics.

Key Features

  • Device selection (CPU/GPU/MPS) for embedding computation

  • On-demand lazy loading of the semantic model

  • Semantic relevance filtering using accelerator/ML/noise queries

  • Category classification with thresholds, keyword overrides, and deduplication

Typical Usage

>>> from living_review.data_model import Paper
>>> from living_review.classifier import filter_relevant_papers, classify_papers
>>> papers = load_some_papers()
>>> relevant = filter_relevant_papers(papers)
>>> classify_papers(relevant)
living_review.classifier.classify_papers(papers, threshold=0.25, max_cats=2)

Assign semantic categories to each paper.

Uses a combination of: - semantic similarity with predefined category descriptions, - special handling for review papers, - keyword overrides (e.g. “surrogate model” → Surrogate Models).

Should be applied after filter_relevant_papers().

Parameters:
  • papers (list of Paper) – Papers to classify in-place (field categories updated).

  • threshold (float, optional) – Minimum similarity required to assign a category (default=0.25).

  • max_cats (int, optional) – Maximum number of categories to keep per paper (default=2).

Returns:

Papers are modified in place. Each .categories becomes a list of dicts with fields: {“label”: str, “score”: float}.

Return type:

None

Notes

  • If no category passes the thresholds, a default Others category with score 0.0 is assigned.

  • Deduplication ensures the highest score per label is kept.

living_review.classifier.device_str()

Select the most appropriate device for embedding computation.

Returns:

“mps” if Apple Metal backend is available, “cuda” if NVIDIA GPU CUDA backend is available, otherwise “cpu”.

Return type:

str

living_review.classifier.dual_semantic_scores(texts)

Compute semantic relevance scores of input texts with respect to accelerator physics, machine learning, and noise queries.

Parameters:

texts (list of str) – List of textual inputs (title + abstract concatenated).

Returns:

(scores_accel, scores_ml, scores_noise), each a list of floats aligned with the input order.

Return type:

tuple of lists

living_review.classifier.filter_relevant_papers(papers, accel_threshold=0.13, ml_threshold=0.18)

Filter a list of papers to retain only those relevant to both accelerator physics and machine learning, while excluding noisy domains (detectors, spectroscopy, HEP analysis, etc.).

Parameters:
  • papers (list of Paper) – Papers to filter. Each must expose .title and .abstract.

  • accel_threshold (float, optional) – Minimum cosine similarity with the accelerator query (default=0.13).

  • ml_threshold (float, optional) – Minimum cosine similarity with the ML query (default=0.18).

Returns:

Subset of input papers deemed relevant.

Return type:

list of Paper

living_review.classifier.load_sem_model()

Lazy-load the sentence transformer model used for semantic similarity.

Loads all-MiniLM-L6-v2 from HuggingFace Hub on the first call and caches it globally. Subsequent calls return the cached model.

Returns:

The loaded MiniLM model, bound to the appropriate device.

Return type:

SentenceTransformer

living_review.cli module

cli.py

Command-line interface (CLI) for the Living Review pipeline.

This script allows users to run the Living Review: ML/AI for Accelerator Physics pipeline directly from the terminal. It provides options to configure the scan window, data sources, classification thresholds, and output location.

Main Features

  • Configure the date range (number of days back from today).

  • Select which data sources to query (arXiv, Inspire, HAL, OpenAlex, Crossref).

  • Override default accelerator/ML relevance thresholds.

  • Control output directory and chunk size for large runs.

  • Optionally enable incremental mode (not yet fully implemented).

  • Enable/disable PDF and BibTeX exports.

Typical Usage

Run a full scan of the last 30 days from all sources with default thresholds:

$ python -m living_review.cli

Run a 60-day scan only from arXiv and Inspire with custom thresholds:

$ python -m living_review.cli –days 60 –sources arxiv,inspire

–accel-threshold 0.15 –ml-threshold 0.20 –output results

Disable PDF export but keep BibTeX:

$ python -m living_review.cli –no-pdf

living_review.cli.main()

Entry point for the Living Review CLI.

Parses command-line arguments, builds the date range and configuration, initializes the LivingReviewPipeline, and runs it.

living_review.config module

config.py

Central configuration module for the Living Review project.

This file collects all constants, keywords, category descriptions, semantic queries, and thresholds used across the pipeline. Keeping them centralized ensures consistency between different modules (fetchers, classifier, pipeline, etc.).

Contents

  • Accelerator / ML keywords

  • Negative keywords (to filter out noise domains)

  • Reference semantic queries (used for similarity scoring)

  • Category descriptions (used for classification)

  • Default thresholds and constants (date window, API page sizes)

Typical Usage

>>> from living_review import config
>>> config.ACCEL_KEYWORDS[:5]
['accelerator', 'linac', 'synchrotron', 'collider', 'storage ring']
living_review.config.ARXIV_PAGE_SIZE = 100

Maximum number of results per page in arXiv API queries.

Type:

int

living_review.config.DATE_WINDOW_DAYS = 7

Default sliding window (in days) for fetching new papers.

Type:

int

living_review.config.DEFAULT_THRESHOLDS = {'accel': 0.13, 'ml': 0.18}

Default semantic similarity thresholds for relevance filtering.

Type:

dict

living_review.data_model module

data_model.py

Data model definitions for the Living Review project.

This module defines the Paper dataclass, the central representation of a scientific paper throughout the pipeline. It ensures consistent handling of metadata, provenance, and status progression, and provides helpers for deduplication and serialization.

Contents

  • Paper: dataclass representing a paper with metadata, provenance, and audit trail.

  • status_rank: helper to order publication statuses.

  • _canonical_key: helper to generate fallback IDs for deduplication.

Canonical use

  • Every paper is represented internally as a Paper object.

  • Papers are serialized into the canonical JSON DB (site/data/livingreview.json) via Paper.to_dict().

  • Papers can be reconstructed from the DB via Paper.from_dict(), guaranteeing a stable round-trip between memory and storage.

Typical Usage

>>> from living_review.data_model import Paper
>>> raw = {"title": "AI for Beam Dynamics", "authors": ["A. Researcher"],
...        "arxiv_id": "1234.5678", "source": "arxiv"}
>>> p = Paper.from_source(raw)
>>> p.key_for_dedup()
('1234.5678', '', 'ai for beam dynamics')
>>> d = p.to_dict()
>>> Paper.from_dict(d).id
'arxiv:1234.5678'
class living_review.data_model.Paper(id, doi=None, arxiv_id=None, inspire_id=None, title='', authors=<factory>, abstract=None, date=None, year=None, venue=None, status=None, categories=<factory>, keywords=<factory>, curated=False, notes=None, links=<factory>, sources=<factory>, history=<factory>, last_updated=None)

Bases: object

Representation of a scientific paper.

id

Canonical identifier (e.g. “doi:…”, “arxiv:…”, or “hash:…”).

Type:

str

doi

Digital Object Identifier if available.

Type:

str, optional

arxiv_id

arXiv identifier if available.

Type:

str, optional

inspire_id

INSPIRE identifier if available.

Type:

str, optional

title

Title of the paper.

Type:

str

authors

List of author names.

Type:

list of str

abstract

Abstract or summary of the paper.

Type:

str, optional

date

ISO date string (YYYY-MM-DD).

Type:

str, optional

year

Publication year.

Type:

int, optional

venue

Journal or conference venue.

Type:

str, optional

status

Publication status (pending, preprint, published…).

Type:

str, optional

categories

Classification categories assigned to the paper.

Type:

list of str

keywords

List of keywords associated with the paper.

Type:

list of str

curated

Whether this entry has been manually curated (protected from overwrite).

Type:

bool

notes

Free-text notes by curators.

Type:

str, optional

Dictionary of useful links (arXiv, DOI, PDF, publisher).

Type:

dict

sources

Provenance info (which fetcher, when).

Type:

list of dict

history

Change history (merges, status updates).

Type:

list of dict

last_updated

Timestamp of last update in ISO format.

Type:

str, optional

abstract: Optional[str] = None
arxiv_id: Optional[str] = None
authors: List[str]
categories: List[str]
curated: bool = False
date: Optional[str] = None
doi: Optional[str] = None
static from_dict(d)

Reconstruct a Paper object from its dictionary representation.

Used when loading from the canonical DB (site/data/livingreview.json).

Parameters:

d (dict) – Dictionary with Paper fields, as produced by to_dict().

Returns:

A Paper instance with all fields populated.

Return type:

Paper

static from_source(raw)

Build a Paper from raw metadata (dict).

Normalizes identifiers, title, and authors, and ensures provenance. Will assign a canonical id of form: - doi:… (if DOI available), - arxiv:… (if arXiv available), - hash:… (fallback hash if no DOI/arXiv).

Parameters:

raw (dict) – Raw metadata from a fetcher.

Returns:

A new Paper object.

Return type:

Paper

history: List[Dict[str, str]]
id: str
inspire_id: Optional[str] = None
key_for_dedup()

Generate a key for deduplication.

Returns:

(arxiv_id, doi, simplified_title)

Return type:

tuple of str

keywords: List[str]
last_updated: Optional[str] = None
links: Dict[str, str]
notes: Optional[str] = None
sources: List[Dict[str, str]]
status: Optional[str] = None
title: str = ''
to_dict()

Serialize Paper to a JSON-safe dict.

This is the method used when writing to the canonical DB (site/data/livingreview.json). It ensures: - Normalized identifiers, - Always includes categories/keywords as lists, - Timestamps in ISO format.

Returns:

Dictionary representation of the Paper.

Return type:

dict

venue: Optional[str] = None
year: Optional[int] = None
living_review.data_model.status_rank(status)

Return integer rank of a status (higher = more advanced).

Parameters:

status (str or None) – Status string (pending, preprint, published…).

Returns:

Position in STATUS_ORDER, or -1 if unknown.

Return type:

int

living_review.exporters module

exporters.py

Output/export utilities for the Living Review project.

This module provides functions to export processed papers and statistics into formats directly consumable by the Hugo site and citation managers.

Export Targets

  • JSONsite/data/livingreview.json + site/data/statistics.json

    The canonical JSON database, containing papers + statistics. Used by Hugo templates, Decap CMS, and downstream visualisations.

  • BibTeXsite/static/downloads/livingreview.bib

    A citation file containing all papers in BibTeX format.

  • PDFsite/static/downloads/livingreview.pdf

    A printable PDF summary of the review.

Typical Usage

>>> from living_review.exporters import export_json, export_bibtex, export_pdf
>>> export_json(papers, stats, outdir=".")
>>> export_bibtex(papers, outdir=".")
>>> export_pdf(papers, stats, outdir=".")
living_review.exporters.export_bibtex(papers, outdir)

Export papers into a BibTeX file for citation management.

Output file

  • site/static/downloads/livingreview.bib

living_review.exporters.export_json(papers, stats, outdir, chunking=None)

Export the canonical JSON database for Hugo and Decap CMS.

Output files

  • site/data/livingreview.json : Full DB (stats + papers)

  • site/data/statistics.json : Simplified global stats

Metadata

  • Adds last_updated as UTC ISO timestamp.

  • Adds next_update based on environment variable UPDATE_INTERVAL_HOURS (default: 24h).

living_review.exporters.export_pdf(papers, stats, outdir)

Export a printable PDF summary of the review.

Output file

  • site/static/downloads/livingreview.pdf

living_review.fetchers module

fetchers.py

Data-source fetchers for the Living Review project.

This module provides functions to query multiple bibliographic APIs and return lists of Paper objects (via Paper.from_source).

Supported sources: - arXiv: via arxiv Python client. - InspireHEP: via REST API. - HAL (Hyper Articles en Ligne). - OpenAlex. - Crossref.

Each fetcher:

  • Retrieves results within a given date window.

  • Normalizes metadata into the canonical schema expected by Paper.

  • Populates links, status, and provenance (source).

A shared requests.Session with retry logic is used for robustness.

living_review.fetchers.arxiv_query_for_window()

Build arXiv queries targeting accelerator physics and ML categories.

Returns:

Query strings to be passed to the arxiv client.

Return type:

list of str

living_review.fetchers.fetch_arxiv(start, end)

Fetch papers from arXiv within the given date range.

Parameters:
  • start (datetime.date) – Start date.

  • end (datetime.date) – End date.

Returns:

Papers retrieved from arXiv.

Return type:

list of Paper

living_review.fetchers.fetch_crossref(start, end)

Fetch papers from Crossref API across PRAB, JACoW, and general accelerator+ML topics.

Combines three categories: - PRAB (prefix:10.1103 PhysRevAccelBeams) - JACoW / IPAC / ICALEPCS / LINAC conference papers - Generic ‘accelerator machine learning’ search

Parameters:
  • start (datetime.date)

  • end (datetime.date)

Return type:

list of Paper

living_review.fetchers.fetch_hal(start, end)

Fetch papers from HAL API (filtered to ML + accelerator physics).

Return type:

List[Paper]

living_review.fetchers.fetch_inspire(start, end, rows=50, max_pages=5)

Fetch papers from InspireHEP API (AI/ML applied to accelerators).

Return type:

List[Paper]

living_review.fetchers.fetch_openalex(start, end)

Fetch papers from OpenAlex API (60-day windows etc.), and set venue to the actual journal/conference (not the source name).

Return type:

List[Paper]

living_review.fetchers.fetch_pubmed(start, end, rows=50)

Fetch papers from Europe PMC (PubMed interface).

Parameters:
  • start (datetime.date) – Start date.

  • end (datetime.date) – End date.

  • rows (int, optional) – Number of results per page (default=50).

Returns:

Papers retrieved from Europe PMC / PubMed.

Return type:

list of Paper

living_review.fetchers.fetch_semanticscholar(start, end, limit=100)

Fetch papers from the Semantic Scholar Graph API related to machine learning and accelerator physics.

Parameters:
  • start (datetime.date) – Start date.

  • end (datetime.date) – End date.

  • limit (int, optional) – Max number of results to fetch (default=100).

Returns:

Papers retrieved from Semantic Scholar.

Return type:

list of Paper

living_review.fetchers.fetch_springer(start, end, rows=20)

Fetch papers from Springer Nature API (PAM v2).

Parameters:
  • start (datetime.date) – Start date.

  • end (datetime.date) – End date.

  • rows (int, optional) – Number of results to retrieve (default=20).

Returns:

Papers retrieved from Springer.

Return type:

list of Paper

living_review.fetchers.make_session()

Create a requests.Session with retry strategy.

Retries on server errors (500, 502, 503, 504) up to 3 times with exponential backoff.

Returns:

Configured session with retry-enabled adapters.

Return type:

requests.Session

living_review.logs module

logs.py

Logging utilities for the Living Review project.

This module manages: - Persistent scan logs (scan_log.json) storing metadata about each run. - Error logs (errors.log) with stack traces. - Retrieval of the last scanned date range.

Contents

  • append_scan_log: record metadata about a scan (papers, chunks, status).

  • log_error: record exceptions and stack traces in a log file.

  • get_last_scan: retrieve the last recorded scan range.

Typical Usage

>>> from living_review import logs
>>> logs.append_scan_log("logs", start, end, npapers=42)
>>> logs.log_error("logs", Exception("Something went wrong"))
>>> last = logs.get_last_scan("logs")
living_review.logs.append_scan_log(logdir, start, end, npapers, nchunks=1, status='ok', error_msg=None)

Append an entry to the scan log (scan_log.json).

Parameters:
  • logdir (str or Path) – Directory where the log files are stored.

  • start (datetime.date or str) – Start date of the scan.

  • end (datetime.date or str) – End date of the scan.

  • npapers (int) – Number of papers processed.

  • nchunks (int, optional) – Number of chunks processed (default=1).

  • status (str, optional) – Status string for the run (default=”ok”).

  • error_msg (str, optional) – Error message if the scan encountered an issue.

Returns:

Updates scan_log.json with a new entry.

Return type:

None

living_review.logs.get_last_scan(logdir)

Retrieve the last scan range from scan_log.json.

Parameters:

logdir (str or Path) – Directory containing the scan log file.

Returns:

Dictionary with keys {“start”: str, “end”: str} if available, otherwise None.

Return type:

dict or None

living_review.logs.log_error(logdir, exc)

Append an error entry with stack trace to errors.log.

Parameters:
  • logdir (str or Path) – Directory where the error log is stored.

  • exc (Exception) – Exception object to log.

Returns:

Writes a timestamped error entry to errors.log.

Return type:

None

living_review.pipeline module

pipeline.py

Main orchestration pipeline for the Living Review project.

The LivingReviewPipeline class coordinates the entire workflow: 1. Load existing DB (if present). 2. Fetch papers from multiple bibliographic sources (arXiv, InspireHEP,

HAL, OpenAlex, Crossref).

  1. Deduplicate & merge into canonical DB (preprint → journal, DOI, etc.).

  2. Optionally ingest CMS-approved manual submissions.

  3. Filter papers for semantic relevance (accelerators ∧ ML).

  4. Classify papers into categories.

  5. Compute statistics.

  6. Export results to Hugo site in multiple formats.

Canonical DB location

The canonical JSON database is stored in the Hugo site’s /data/ folder:

site/data/livingreview.json

This file is: - Updated by this pipeline when merging new papers, - Exported again at the end of the run (papers + stats), - Read by Hugo templates (.Site.Data.livingreview), - Editable via Decap CMS.

Other outputs

  • BibTeX → site/static/downloads/livingreview.bib

  • PDF → site/static/downloads/livingreview.pdf

(Note: HTML export has been removed — Hugo now builds pages directly from the JSON DB.)

class living_review.pipeline.LivingReviewPipeline(start, end, sources=None, thresholds=None, output_dir='.', chunking=None, db_path='site/data/livingreview.json', promote_manual=False)

Bases: object

Orchestrates the end-to-end Living Review workflow.

run()

Execute the pipeline end-to-end.

living_review.stats module

stats.py

Computation of summary statistics for the Living Review project.

This module aggregates counts of papers by year, category, venue, keyword, and monthly trends. These statistics are used for reporting and visualizations in the exported JSON/HTML outputs.

Contents

  • KEYWORDS: predefined list of relevant keywords to track.

  • compute_stats: aggregate statistics from a list of papers.

Typical Usage

>>> from living_review.stats import compute_stats
>>> stats = compute_stats(papers)
>>> stats["per_year"]
{'2024': 15, '2025': 7}
living_review.stats.compute_stats(papers)

Compute aggregated statistics from a list of papers.

Parameters:

papers (list of Paper) – Papers to analyze. Each must have attributes .year, .categories, .venue, .title, .abstract, and .date (string ISO or datetime).

Returns:

Dictionary with the following keys: - “per_year”: counts of papers per publication year. - “per_category”: counts of papers per semantic category. - “per_venue/journal”: counts of papers per venue/journal. - “per_keyword”: counts of predefined keywords matched in titles/abstracts. - “monthly_trends”: counts of papers per month (YYYY-MM).

Return type:

dict

living_review.utils module

utils.py

Utility functions for the Living Review project.

This module provides helper functions for: - Deduplicating papers based on unique keys. - Normalizing identifiers (DOI, arXiv ID). - Cleaning up LaTeX markup and titles for comparison. - Fuzzy similarity scoring between titles. - Checking if a date lies within a given range.

Contents

  • deduplicate: remove duplicate papers by (arxiv_id, doi, normalized_title).

  • within_range: test whether a date falls within [start, end].

  • norm_doi: normalize DOI strings to a canonical form.

  • norm_arxiv_id: normalize arXiv identifiers to a canonical form.

  • simplify_title: lowercase, strip LaTeX and punctuation for fuzzy matching.

  • first_author_key: heuristic to extract first author surname.

  • similar_title: fuzzy similarity score between two titles.

Typical Usage

>>> from living_review.utils import norm_doi, simplify_title, within_range
>>> norm_doi("https://doi.org/10.1103/PhysRevLett.123.456")
'10.1103/physrevlett.123.456'
>>> simplify_title("A {LaTeX} Example: On $\alpha$-decay")
'a latex example on alpha decay'
>>> within_range(dt.date(2025, 1, 10), start, end)
True
living_review.utils.deduplicate(papers)

Remove duplicate papers based on their deduplication key.

Each Paper must implement .key_for_dedup() which returns a tuple (arxiv_id, doi, normalized_title). Duplicates are detected when this key repeats.

Parameters:

papers (list of Paper) – Papers to deduplicate.

Returns:

Deduplicated list of papers (order preserved: first occurrence kept).

Return type:

list of Paper

living_review.utils.first_author_key(authors)

Heuristic key for first author: uses last token of first author’s name. Returns lowercase surname or None if unavailable.

Return type:

Optional[str]

living_review.utils.make_session()

Create a shared requests.Session with retry strategy. Retries on server errors (500, 502, 503, 504) up to 3 times with exponential backoff.

living_review.utils.norm_arxiv_id(ax)

Normalize arXiv identifiers (remove prefix and version).

Return type:

Optional[str]

living_review.utils.norm_doi(doi)

Normalize DOI to lowercase without URL prefixes.

Return type:

Optional[str]

living_review.utils.norm_space(s)

Collapse multiple spaces and trim a string.

Return type:

Optional[str]

living_review.utils.similar_title(a, b)

Compute fuzzy similarity ratio between two titles.

Parameters:
  • a (str) – Titles to compare.

  • b (str) – Titles to compare.

Returns:

Similarity ratio in [0, 1], where 1 = identical.

Return type:

float

living_review.utils.simplify_title(t)

Lowercase, strip LaTeX, punctuation, and extra spaces from title.

Return type:

Optional[str]

living_review.utils.within_range(d, start, end)

Check whether a date lies within a given range [start, end].

Parameters:
  • d (datetime.date) – Date to test.

  • start (datetime.date) – Start of the range.

  • end (datetime.date) – End of the range.

Returns:

True if start <= d <= end, otherwise False.

Return type:

bool

Module contents

living_review

A Python package for managing and analyzing Living Reviews, with a focus on applications in particle accelerators and machine learning.

This package provides: - Data model (Paper class) to represent scientific papers. - Fetchers for multiple bibliographic APIs (arXiv, InspireHEP, HAL,

OpenAlex, Crossref).

  • Semantic filtering and classification of papers using sentence-transformers.

  • Statistics computation for bibliometrics and trends.

  • Export utilities to JSON and HTML.

  • Logging of scans and errors.

  • A pipeline (LivingReviewPipeline) to orchestrate the entire workflow.

  • A CLI (living_review.cli) for running scans from the terminal.

living_review.__version__

Current version of the package.

Type:

str