living_review package
Submodules
living_review.classifier module
classifier.py
This module provides semantic filtering and classification utilities for scientific papers in the context of machine-learning applications to particle accelerator physics.
It uses a pre-trained sentence-transformers model (MiniLM-L6-v2) to compute embeddings for papers, accelerator/ML reference queries, and category descriptions. Papers are first filtered for relevance (accelerators ∧ ML, excluding domain noise), and then assigned to categories using semantic similarity and keyword heuristics.
Key Features
Device selection (CPU/GPU/MPS) for embedding computation
On-demand lazy loading of the semantic model
Semantic relevance filtering using accelerator/ML/noise queries
Category classification with thresholds, keyword overrides, and deduplication
Typical Usage
>>> from living_review.data_model import Paper
>>> from living_review.classifier import filter_relevant_papers, classify_papers
>>> papers = load_some_papers()
>>> relevant = filter_relevant_papers(papers)
>>> classify_papers(relevant)
- living_review.classifier.classify_papers(papers, threshold=0.25, max_cats=2)
Assign semantic categories to each paper.
Uses a combination of: - semantic similarity with predefined category descriptions, - special handling for review papers, - keyword overrides (e.g. “surrogate model” → Surrogate Models).
Should be applied after filter_relevant_papers().
- Parameters:
papers (list of Paper) – Papers to classify in-place (field categories updated).
threshold (float, optional) – Minimum similarity required to assign a category (default=0.25).
max_cats (int, optional) – Maximum number of categories to keep per paper (default=2).
- Returns:
Papers are modified in place. Each .categories becomes a list of dicts with fields: {“label”: str, “score”: float}.
- Return type:
None
Notes
If no category passes the thresholds, a default Others category with score 0.0 is assigned.
Deduplication ensures the highest score per label is kept.
- living_review.classifier.device_str()
Select the most appropriate device for embedding computation.
- Returns:
“mps” if Apple Metal backend is available, “cuda” if NVIDIA GPU CUDA backend is available, otherwise “cpu”.
- Return type:
str
- living_review.classifier.dual_semantic_scores(texts)
Compute semantic relevance scores of input texts with respect to accelerator physics, machine learning, and noise queries.
- Parameters:
texts (list of str) – List of textual inputs (title + abstract concatenated).
- Returns:
(scores_accel, scores_ml, scores_noise), each a list of floats aligned with the input order.
- Return type:
tuple of lists
- living_review.classifier.filter_relevant_papers(papers, accel_threshold=0.13, ml_threshold=0.18)
Filter a list of papers to retain only those relevant to both accelerator physics and machine learning, while excluding noisy domains (detectors, spectroscopy, HEP analysis, etc.).
- Parameters:
papers (list of Paper) – Papers to filter. Each must expose .title and .abstract.
accel_threshold (float, optional) – Minimum cosine similarity with the accelerator query (default=0.13).
ml_threshold (float, optional) – Minimum cosine similarity with the ML query (default=0.18).
- Returns:
Subset of input papers deemed relevant.
- Return type:
list of Paper
- living_review.classifier.load_sem_model()
Lazy-load the sentence transformer model used for semantic similarity.
Loads all-MiniLM-L6-v2 from HuggingFace Hub on the first call and caches it globally. Subsequent calls return the cached model.
- Returns:
The loaded MiniLM model, bound to the appropriate device.
- Return type:
SentenceTransformer
living_review.cli module
cli.py
Command-line interface (CLI) for the Living Review pipeline.
This script allows users to run the Living Review: ML/AI for Accelerator Physics pipeline directly from the terminal. It provides options to configure the scan window, data sources, classification thresholds, and output location.
Main Features
Configure the date range (number of days back from today).
Select which data sources to query (arXiv, Inspire, HAL, OpenAlex, Crossref).
Override default accelerator/ML relevance thresholds.
Control output directory and chunk size for large runs.
Optionally enable incremental mode (not yet fully implemented).
Enable/disable PDF and BibTeX exports.
Typical Usage
Run a full scan of the last 30 days from all sources with default thresholds:
$ python -m living_review.cli
Run a 60-day scan only from arXiv and Inspire with custom thresholds:
- $ python -m living_review.cli –days 60 –sources arxiv,inspire
–accel-threshold 0.15 –ml-threshold 0.20 –output results
Disable PDF export but keep BibTeX:
$ python -m living_review.cli –no-pdf
- living_review.cli.main()
Entry point for the Living Review CLI.
Parses command-line arguments, builds the date range and configuration, initializes the LivingReviewPipeline, and runs it.
living_review.config module
config.py
Central configuration module for the Living Review project.
This file collects all constants, keywords, category descriptions, semantic queries, and thresholds used across the pipeline. Keeping them centralized ensures consistency between different modules (fetchers, classifier, pipeline, etc.).
Contents
Accelerator / ML keywords
Negative keywords (to filter out noise domains)
Reference semantic queries (used for similarity scoring)
Category descriptions (used for classification)
Default thresholds and constants (date window, API page sizes)
Typical Usage
>>> from living_review import config
>>> config.ACCEL_KEYWORDS[:5]
['accelerator', 'linac', 'synchrotron', 'collider', 'storage ring']
- living_review.config.ARXIV_PAGE_SIZE = 100
Maximum number of results per page in arXiv API queries.
- Type:
int
- living_review.config.DATE_WINDOW_DAYS = 7
Default sliding window (in days) for fetching new papers.
- Type:
int
- living_review.config.DEFAULT_THRESHOLDS = {'accel': 0.13, 'ml': 0.18}
Default semantic similarity thresholds for relevance filtering.
- Type:
dict
living_review.data_model module
data_model.py
Data model definitions for the Living Review project.
This module defines the Paper dataclass, the central representation of a scientific paper throughout the pipeline. It ensures consistent handling of metadata, provenance, and status progression, and provides helpers for deduplication and serialization.
Contents
Paper: dataclass representing a paper with metadata, provenance, and audit trail.
status_rank: helper to order publication statuses.
_canonical_key: helper to generate fallback IDs for deduplication.
Canonical use
Every paper is represented internally as a Paper object.
Papers are serialized into the canonical JSON DB (site/data/livingreview.json) via Paper.to_dict().
Papers can be reconstructed from the DB via Paper.from_dict(), guaranteeing a stable round-trip between memory and storage.
Typical Usage
>>> from living_review.data_model import Paper
>>> raw = {"title": "AI for Beam Dynamics", "authors": ["A. Researcher"],
... "arxiv_id": "1234.5678", "source": "arxiv"}
>>> p = Paper.from_source(raw)
>>> p.key_for_dedup()
('1234.5678', '', 'ai for beam dynamics')
>>> d = p.to_dict()
>>> Paper.from_dict(d).id
'arxiv:1234.5678'
- class living_review.data_model.Paper(id, doi=None, arxiv_id=None, inspire_id=None, title='', authors=<factory>, abstract=None, date=None, year=None, venue=None, status=None, categories=<factory>, keywords=<factory>, curated=False, notes=None, links=<factory>, sources=<factory>, history=<factory>, last_updated=None)
Bases:
objectRepresentation of a scientific paper.
- id
Canonical identifier (e.g. “doi:…”, “arxiv:…”, or “hash:…”).
- Type:
str
- doi
Digital Object Identifier if available.
- Type:
str, optional
- arxiv_id
arXiv identifier if available.
- Type:
str, optional
- inspire_id
INSPIRE identifier if available.
- Type:
str, optional
- title
Title of the paper.
- Type:
str
- authors
List of author names.
- Type:
list of str
- abstract
Abstract or summary of the paper.
- Type:
str, optional
- date
ISO date string (YYYY-MM-DD).
- Type:
str, optional
- year
Publication year.
- Type:
int, optional
- venue
Journal or conference venue.
- Type:
str, optional
- status
Publication status (pending, preprint, published…).
- Type:
str, optional
- categories
Classification categories assigned to the paper.
- Type:
list of str
- keywords
List of keywords associated with the paper.
- Type:
list of str
- curated
Whether this entry has been manually curated (protected from overwrite).
- Type:
bool
- notes
Free-text notes by curators.
- Type:
str, optional
- links
Dictionary of useful links (arXiv, DOI, PDF, publisher).
- Type:
dict
- sources
Provenance info (which fetcher, when).
- Type:
list of dict
- history
Change history (merges, status updates).
- Type:
list of dict
- last_updated
Timestamp of last update in ISO format.
- Type:
str, optional
-
abstract:
Optional[str] = None
-
arxiv_id:
Optional[str] = None
-
authors:
List[str]
-
categories:
List[str]
-
curated:
bool= False
-
date:
Optional[str] = None
-
doi:
Optional[str] = None
- static from_dict(d)
Reconstruct a Paper object from its dictionary representation.
Used when loading from the canonical DB (site/data/livingreview.json).
- Parameters:
d (dict) – Dictionary with Paper fields, as produced by to_dict().
- Returns:
A Paper instance with all fields populated.
- Return type:
- static from_source(raw)
Build a Paper from raw metadata (dict).
Normalizes identifiers, title, and authors, and ensures provenance. Will assign a canonical id of form: - doi:… (if DOI available), - arxiv:… (if arXiv available), - hash:… (fallback hash if no DOI/arXiv).
- Parameters:
raw (dict) – Raw metadata from a fetcher.
- Returns:
A new Paper object.
- Return type:
-
history:
List[Dict[str,str]]
-
id:
str
-
inspire_id:
Optional[str] = None
- key_for_dedup()
Generate a key for deduplication.
- Returns:
(arxiv_id, doi, simplified_title)
- Return type:
tuple of str
-
keywords:
List[str]
-
last_updated:
Optional[str] = None
-
links:
Dict[str,str]
-
notes:
Optional[str] = None
-
sources:
List[Dict[str,str]]
-
status:
Optional[str] = None
-
title:
str= ''
- to_dict()
Serialize Paper to a JSON-safe dict.
This is the method used when writing to the canonical DB (site/data/livingreview.json). It ensures: - Normalized identifiers, - Always includes categories/keywords as lists, - Timestamps in ISO format.
- Returns:
Dictionary representation of the Paper.
- Return type:
dict
-
venue:
Optional[str] = None
-
year:
Optional[int] = None
- living_review.data_model.status_rank(status)
Return integer rank of a status (higher = more advanced).
- Parameters:
status (str or None) – Status string (pending, preprint, published…).
- Returns:
Position in STATUS_ORDER, or -1 if unknown.
- Return type:
int
living_review.exporters module
exporters.py
Output/export utilities for the Living Review project.
This module provides functions to export processed papers and statistics into formats directly consumable by the Hugo site and citation managers.
Export Targets
- JSON → site/data/livingreview.json + site/data/statistics.json
The canonical JSON database, containing papers + statistics. Used by Hugo templates, Decap CMS, and downstream visualisations.
- BibTeX → site/static/downloads/livingreview.bib
A citation file containing all papers in BibTeX format.
- PDF → site/static/downloads/livingreview.pdf
A printable PDF summary of the review.
Typical Usage
>>> from living_review.exporters import export_json, export_bibtex, export_pdf
>>> export_json(papers, stats, outdir=".")
>>> export_bibtex(papers, outdir=".")
>>> export_pdf(papers, stats, outdir=".")
- living_review.exporters.export_bibtex(papers, outdir)
Export papers into a BibTeX file for citation management.
Output file
site/static/downloads/livingreview.bib
- living_review.exporters.export_json(papers, stats, outdir, chunking=None)
Export the canonical JSON database for Hugo and Decap CMS.
Output files
site/data/livingreview.json : Full DB (stats + papers)
site/data/statistics.json : Simplified global stats
Metadata
Adds last_updated as UTC ISO timestamp.
Adds next_update based on environment variable UPDATE_INTERVAL_HOURS (default: 24h).
living_review.fetchers module
fetchers.py
Data-source fetchers for the Living Review project.
This module provides functions to query multiple bibliographic APIs and return lists of Paper objects (via Paper.from_source).
Supported sources: - arXiv: via arxiv Python client. - InspireHEP: via REST API. - HAL (Hyper Articles en Ligne). - OpenAlex. - Crossref.
Each fetcher:
Retrieves results within a given date window.
Normalizes metadata into the canonical schema expected by Paper.
Populates links, status, and provenance (source).
A shared requests.Session with retry logic is used for robustness.
- living_review.fetchers.arxiv_query_for_window()
Build arXiv queries targeting accelerator physics and ML categories.
- Returns:
Query strings to be passed to the arxiv client.
- Return type:
list of str
- living_review.fetchers.fetch_arxiv(start, end)
Fetch papers from arXiv within the given date range.
- Parameters:
start (datetime.date) – Start date.
end (datetime.date) – End date.
- Returns:
Papers retrieved from arXiv.
- Return type:
list of Paper
- living_review.fetchers.fetch_crossref(start, end)
Fetch papers from Crossref API across PRAB, JACoW, and general accelerator+ML topics.
Combines three categories: - PRAB (prefix:10.1103 PhysRevAccelBeams) - JACoW / IPAC / ICALEPCS / LINAC conference papers - Generic ‘accelerator machine learning’ search
- Parameters:
start (datetime.date)
end (datetime.date)
- Return type:
list of Paper
- living_review.fetchers.fetch_hal(start, end)
Fetch papers from HAL API (filtered to ML + accelerator physics).
- Return type:
List[Paper]
- living_review.fetchers.fetch_inspire(start, end, rows=50, max_pages=5)
Fetch papers from InspireHEP API (AI/ML applied to accelerators).
- Return type:
List[Paper]
- living_review.fetchers.fetch_openalex(start, end)
Fetch papers from OpenAlex API (60-day windows etc.), and set venue to the actual journal/conference (not the source name).
- Return type:
List[Paper]
- living_review.fetchers.fetch_pubmed(start, end, rows=50)
Fetch papers from Europe PMC (PubMed interface).
- Parameters:
start (datetime.date) – Start date.
end (datetime.date) – End date.
rows (int, optional) – Number of results per page (default=50).
- Returns:
Papers retrieved from Europe PMC / PubMed.
- Return type:
list of Paper
- living_review.fetchers.fetch_semanticscholar(start, end, limit=100)
Fetch papers from the Semantic Scholar Graph API related to machine learning and accelerator physics.
- Parameters:
start (datetime.date) – Start date.
end (datetime.date) – End date.
limit (int, optional) – Max number of results to fetch (default=100).
- Returns:
Papers retrieved from Semantic Scholar.
- Return type:
list of Paper
- living_review.fetchers.fetch_springer(start, end, rows=20)
Fetch papers from Springer Nature API (PAM v2).
- Parameters:
start (datetime.date) – Start date.
end (datetime.date) – End date.
rows (int, optional) – Number of results to retrieve (default=20).
- Returns:
Papers retrieved from Springer.
- Return type:
list of Paper
- living_review.fetchers.make_session()
Create a requests.Session with retry strategy.
Retries on server errors (500, 502, 503, 504) up to 3 times with exponential backoff.
- Returns:
Configured session with retry-enabled adapters.
- Return type:
requests.Session
living_review.logs module
logs.py
Logging utilities for the Living Review project.
This module manages: - Persistent scan logs (scan_log.json) storing metadata about each run. - Error logs (errors.log) with stack traces. - Retrieval of the last scanned date range.
Contents
append_scan_log: record metadata about a scan (papers, chunks, status).
log_error: record exceptions and stack traces in a log file.
get_last_scan: retrieve the last recorded scan range.
Typical Usage
>>> from living_review import logs
>>> logs.append_scan_log("logs", start, end, npapers=42)
>>> logs.log_error("logs", Exception("Something went wrong"))
>>> last = logs.get_last_scan("logs")
- living_review.logs.append_scan_log(logdir, start, end, npapers, nchunks=1, status='ok', error_msg=None)
Append an entry to the scan log (scan_log.json).
- Parameters:
logdir (str or Path) – Directory where the log files are stored.
start (datetime.date or str) – Start date of the scan.
end (datetime.date or str) – End date of the scan.
npapers (int) – Number of papers processed.
nchunks (int, optional) – Number of chunks processed (default=1).
status (str, optional) – Status string for the run (default=”ok”).
error_msg (str, optional) – Error message if the scan encountered an issue.
- Returns:
Updates scan_log.json with a new entry.
- Return type:
None
- living_review.logs.get_last_scan(logdir)
Retrieve the last scan range from scan_log.json.
- Parameters:
logdir (str or Path) – Directory containing the scan log file.
- Returns:
Dictionary with keys {“start”: str, “end”: str} if available, otherwise None.
- Return type:
dict or None
- living_review.logs.log_error(logdir, exc)
Append an error entry with stack trace to errors.log.
- Parameters:
logdir (str or Path) – Directory where the error log is stored.
exc (Exception) – Exception object to log.
- Returns:
Writes a timestamped error entry to errors.log.
- Return type:
None
living_review.pipeline module
pipeline.py
Main orchestration pipeline for the Living Review project.
The LivingReviewPipeline class coordinates the entire workflow: 1. Load existing DB (if present). 2. Fetch papers from multiple bibliographic sources (arXiv, InspireHEP,
HAL, OpenAlex, Crossref).
Deduplicate & merge into canonical DB (preprint → journal, DOI, etc.).
Optionally ingest CMS-approved manual submissions.
Filter papers for semantic relevance (accelerators ∧ ML).
Classify papers into categories.
Compute statistics.
Export results to Hugo site in multiple formats.
Canonical DB location
The canonical JSON database is stored in the Hugo site’s /data/ folder:
site/data/livingreview.json
This file is: - Updated by this pipeline when merging new papers, - Exported again at the end of the run (papers + stats), - Read by Hugo templates (.Site.Data.livingreview), - Editable via Decap CMS.
Other outputs
BibTeX → site/static/downloads/livingreview.bib
PDF → site/static/downloads/livingreview.pdf
(Note: HTML export has been removed — Hugo now builds pages directly from the JSON DB.)
living_review.stats module
stats.py
Computation of summary statistics for the Living Review project.
This module aggregates counts of papers by year, category, venue, keyword, and monthly trends. These statistics are used for reporting and visualizations in the exported JSON/HTML outputs.
Contents
KEYWORDS: predefined list of relevant keywords to track.
compute_stats: aggregate statistics from a list of papers.
Typical Usage
>>> from living_review.stats import compute_stats
>>> stats = compute_stats(papers)
>>> stats["per_year"]
{'2024': 15, '2025': 7}
- living_review.stats.compute_stats(papers)
Compute aggregated statistics from a list of papers.
- Parameters:
papers (list of Paper) – Papers to analyze. Each must have attributes .year, .categories, .venue, .title, .abstract, and .date (string ISO or datetime).
- Returns:
Dictionary with the following keys: - “per_year”: counts of papers per publication year. - “per_category”: counts of papers per semantic category. - “per_venue/journal”: counts of papers per venue/journal. - “per_keyword”: counts of predefined keywords matched in titles/abstracts. - “monthly_trends”: counts of papers per month (YYYY-MM).
- Return type:
dict
living_review.utils module
utils.py
Utility functions for the Living Review project.
This module provides helper functions for: - Deduplicating papers based on unique keys. - Normalizing identifiers (DOI, arXiv ID). - Cleaning up LaTeX markup and titles for comparison. - Fuzzy similarity scoring between titles. - Checking if a date lies within a given range.
Contents
deduplicate: remove duplicate papers by (arxiv_id, doi, normalized_title).
within_range: test whether a date falls within [start, end].
norm_doi: normalize DOI strings to a canonical form.
norm_arxiv_id: normalize arXiv identifiers to a canonical form.
simplify_title: lowercase, strip LaTeX and punctuation for fuzzy matching.
first_author_key: heuristic to extract first author surname.
similar_title: fuzzy similarity score between two titles.
Typical Usage
>>> from living_review.utils import norm_doi, simplify_title, within_range
>>> norm_doi("https://doi.org/10.1103/PhysRevLett.123.456")
'10.1103/physrevlett.123.456'
>>> simplify_title("A {LaTeX} Example: On $\alpha$-decay")
'a latex example on alpha decay'
>>> within_range(dt.date(2025, 1, 10), start, end)
True
- living_review.utils.deduplicate(papers)
Remove duplicate papers based on their deduplication key.
Each Paper must implement .key_for_dedup() which returns a tuple (arxiv_id, doi, normalized_title). Duplicates are detected when this key repeats.
- living_review.utils.first_author_key(authors)
Heuristic key for first author: uses last token of first author’s name. Returns lowercase surname or None if unavailable.
- Return type:
Optional[str]
- living_review.utils.make_session()
Create a shared requests.Session with retry strategy. Retries on server errors (500, 502, 503, 504) up to 3 times with exponential backoff.
- living_review.utils.norm_arxiv_id(ax)
Normalize arXiv identifiers (remove prefix and version).
- Return type:
Optional[str]
- living_review.utils.norm_doi(doi)
Normalize DOI to lowercase without URL prefixes.
- Return type:
Optional[str]
- living_review.utils.norm_space(s)
Collapse multiple spaces and trim a string.
- Return type:
Optional[str]
- living_review.utils.similar_title(a, b)
Compute fuzzy similarity ratio between two titles.
- Parameters:
a (str) – Titles to compare.
b (str) – Titles to compare.
- Returns:
Similarity ratio in [0, 1], where 1 = identical.
- Return type:
float
- living_review.utils.simplify_title(t)
Lowercase, strip LaTeX, punctuation, and extra spaces from title.
- Return type:
Optional[str]
- living_review.utils.within_range(d, start, end)
Check whether a date lies within a given range [start, end].
- Parameters:
d (datetime.date) – Date to test.
start (datetime.date) – Start of the range.
end (datetime.date) – End of the range.
- Returns:
True if start <= d <= end, otherwise False.
- Return type:
bool
Module contents
living_review
A Python package for managing and analyzing Living Reviews, with a focus on applications in particle accelerators and machine learning.
This package provides: - Data model (Paper class) to represent scientific papers. - Fetchers for multiple bibliographic APIs (arXiv, InspireHEP, HAL,
OpenAlex, Crossref).
Semantic filtering and classification of papers using sentence-transformers.
Statistics computation for bibliometrics and trends.
Export utilities to JSON and HTML.
Logging of scans and errors.
A pipeline (LivingReviewPipeline) to orchestrate the entire workflow.
A CLI (living_review.cli) for running scans from the terminal.
- living_review.__version__
Current version of the package.
- Type:
str