文章 代码库 城市生活记忆 Claude Skill AI分享 问龙虾
返回 Claude Skill

学术引文管理

学术研究的全面引文管理,搜索 Google Scholar、PubMed 等数据库

研究 社区公开 by Community

Citation Management

Overview

Manage citations systematically throughout the research and writing process. This skill provides tools and strategies for searching academic databases (Google Scholar, PubMed), extracting accurate metadata from multiple sources (CrossRef, PubMed, arXiv), validating citation information, and generating properly formatted BibTeX entries.

Critical for maintaining citation accuracy, avoiding reference errors, and ensuring reproducible research. Integrates seamlessly with the literature-review skill for comprehensive research workflows.

When to Use This Skill

Use this skill when:

  • Searching for specific papers on Google Scholar or PubMed
  • Converting DOIs, PMIDs, or arXiv IDs to properly formatted BibTeX
  • Extracting complete metadata for citations (authors, title, journal, year, etc.)
  • Validating existing citations for accuracy
  • Cleaning and formatting BibTeX files
  • Finding highly cited papers in a specific field
  • Verifying that citation information matches the actual publication
  • Building a bibliography for a manuscript or thesis
  • Checking for duplicate citations
  • Ensuring consistent citation formatting

Visual Enhancement with Scientific Schematics

When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.

If your document does not already contain schematics or diagrams:

  • Use the scientific-schematics skill to generate AI-powered publication-quality diagrams
  • Simply describe your desired diagram in natural language
  • Nano Banana Pro will automatically generate, review, and refine the schematic

For new documents: Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.

How to generate schematics:

python scripts/generate_schematic.py "your diagram description" -o figures/output.png

The AI will automatically:

  • Create publication-quality images with proper formatting
  • Review and refine through multiple iterations
  • Ensure accessibility (colorblind-friendly, high contrast)
  • Save outputs in the figures/ directory

When to add schematics:

  • Citation workflow diagrams
  • Literature search methodology flowcharts
  • Reference management system architectures
  • Citation style decision trees
  • Database integration diagrams
  • Any complex concept that benefits from visualization

For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.


Core Workflow

Citation management follows a systematic process:

Goal: Find relevant papers using academic search engines.

Google Scholar provides the most comprehensive coverage across disciplines.

Basic Search:

# Search for papers on a topic
python scripts/search_google_scholar.py "CRISPR gene editing" \
  --limit 50 \
  --output results.json

# Search with year filter
python scripts/search_google_scholar.py "machine learning protein folding" \
  --year-start 2020 \
  --year-end 2024 \
  --limit 100 \
  --output ml_proteins.json

Advanced Search Strategies (see references/google_scholar_search.md):

  • Use quotation marks for exact phrases: "deep learning"
  • Search by author: author:LeCun
  • Search in title: intitle:"neural networks"
  • Exclude terms: machine learning -survey
  • Find highly cited papers using sort options
  • Filter by date ranges to get recent work

Best Practices:

  • Use specific, targeted search terms
  • Include key technical terms and acronyms
  • Filter by recent years for fast-moving fields
  • Check “Cited by” to find seminal papers
  • Export top results for further analysis

PubMed specializes in biomedical and life sciences literature (35+ million citations).

Basic Search:

# Search PubMed
python scripts/search_pubmed.py "Alzheimer's disease treatment" \
  --limit 100 \
  --output alzheimers.json

# Search with MeSH terms and filters
python scripts/search_pubmed.py \
  --query '"Alzheimer Disease"[MeSH] AND "Drug Therapy"[MeSH]' \
  --date-start 2020 \
  --date-end 2024 \
  --publication-types "Clinical Trial,Review" \
  --output alzheimers_trials.json

Advanced PubMed Queries (see references/pubmed_search.md):

  • Use MeSH terms: "Diabetes Mellitus"[MeSH]
  • Field tags: "cancer"[Title], "Smith J"[Author]
  • Boolean operators: AND, OR, NOT
  • Date filters: 2020:2024[Publication Date]
  • Publication types: "Review"[Publication Type]
  • Combine with E-utilities API for automation

Best Practices:

  • Use MeSH Browser to find correct controlled vocabulary
  • Construct complex queries in PubMed Advanced Search Builder first
  • Include multiple synonyms with OR
  • Retrieve PMIDs for easy metadata extraction
  • Export to JSON or directly to BibTeX

Phase 2: Metadata Extraction

Goal: Convert paper identifiers (DOI, PMID, arXiv ID) to complete, accurate metadata.

Quick DOI to BibTeX Conversion

For single DOIs, use the quick conversion tool:

# Convert single DOI
python scripts/doi_to_bibtex.py 10.1038/s41586-021-03819-2

# Convert multiple DOIs from a file
python scripts/doi_to_bibtex.py --input dois.txt --output references.bib

# Different output formats
python scripts/doi_to_bibtex.py 10.1038/nature12345 --format json

Comprehensive Metadata Extraction

For DOIs, PMIDs, arXiv IDs, or URLs:

# Extract from DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2

# Extract from PMID
python scripts/extract_metadata.py --pmid 34265844

# Extract from arXiv ID
python scripts/extract_metadata.py --arxiv 2103.14030

# Extract from URL
python scripts/extract_metadata.py --url "https://www.nature.com/articles/s41586-021-03819-2"

# Batch extraction from file (mixed identifiers)
python scripts/extract_metadata.py --input identifiers.txt --output citations.bib

Metadata Sources (see references/metadata_extraction.md):

  1. CrossRef API: Primary source for DOIs

    • Comprehensive metadata for journal articles
    • Publisher-provided information
    • Includes authors, title, journal, volume, pages, dates
    • Free, no API key required
  2. PubMed E-utilities: Biomedical literature

    • Official NCBI metadata
    • Includes MeSH terms, abstracts
    • PMID and PMCID identifiers
    • Free, API key recommended for high volume
  3. arXiv API: Preprints in physics, math, CS, q-bio

    • Complete metadata for preprints
    • Version tracking
    • Author affiliations
    • Free, open access
  4. DataCite API: Research datasets, software, other resources

    • Metadata for non-traditional scholarly outputs
    • DOIs for datasets and code
    • Free access

What Gets Extracted:

  • Required fields: author, title, year
  • Journal articles: journal, volume, number, pages, DOI
  • Books: publisher, ISBN, edition
  • Conference papers: booktitle, conference location, pages
  • Preprints: repository (arXiv, bioRxiv), preprint ID
  • Additional: abstract, keywords, URL

Phase 3: BibTeX Formatting

Goal: Generate clean, properly formatted BibTeX entries.

Understanding BibTeX Entry Types

See references/bibtex_formatting.md for complete guide.

Common Entry Types:

  • @article: Journal articles (most common)
  • @book: Books
  • @inproceedings: Conference papers
  • @incollection: Book chapters
  • @phdthesis: Dissertations
  • @misc: Preprints, software, datasets

Required Fields by Type:

@article{citationkey,
  author  = {Last1, First1 and Last2, First2},
  title   = {Article Title},
  journal = {Journal Name},
  year    = {2024},
  volume  = {10},
  number  = {3},
  pages   = {123--145},
  doi     = {10.1234/example}
}

@inproceedings{citationkey,
  author    = {Last, First},
  title     = {Paper Title},
  booktitle = {Conference Name},
  year      = {2024},
  pages     = {1--10}
}

@book{citationkey,
  author    = {Last, First},
  title     = {Book Title},
  publisher = {Publisher Name},
  year      = {2024}
}

Formatting and Cleaning

Use the formatter to standardize BibTeX files:

# Format and clean BibTeX file
python scripts/format_bibtex.py references.bib \
  --output formatted_references.bib

# Sort entries by citation key
python scripts/format_bibtex.py references.bib \
  --sort key \
  --output sorted_references.bib

# Sort by year (newest first)
python scripts/format_bibtex.py references.bib \
  --sort year \
  --descending \
  --output sorted_references.bib

# Remove duplicates
python scripts/format_bibtex.py references.bib \
  --deduplicate \
  --output clean_references.bib

# Validate and report issues
python scripts/format_bibtex.py references.bib \
  --validate \
  --report validation_report.txt

Formatting Operations:

  • Standardize field order
  • Consistent indentation and spacing
  • Proper capitalization in titles (protected with {})
  • Standardized author name format
  • Consistent citation key format
  • Remove unnecessary fields
  • Fix common errors (missing commas, braces)

Phase 4: Citation Validation

Goal: Verify all citations are accurate and complete.

Comprehensive Validation

# Validate BibTeX file
python scripts/validate_citations.py references.bib

# Validate and fix common issues
python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output validated_references.bib

# Generate detailed validation report
python scripts/validate_citations.py references.bib \
  --report validation_report.json \
  --verbose

Validation Checks (see references/citation_validation.md):

  1. DOI Verification:

    • DOI resolves correctly via doi.org
    • Metadata matches between BibTeX and CrossRef
    • No broken or invalid DOIs
  2. Required Fields:

    • All required fields present for entry type
    • No empty or missing critical information
    • Author names properly formatted
  3. Data Consistency:

    • Year is valid (4 digits, reasonable range)
    • Volume/number are numeric
    • Pages formatted correctly (e.g., 123—145)
    • URLs are accessible
  4. Duplicate Detection:

    • Same DOI used multiple times
    • Similar titles (possible duplicates)
    • Same author/year/title combinations
  5. Format Compliance:

    • Valid BibTeX syntax
    • Proper bracing and quoting
    • Citation keys are unique
    • Special characters handled correctly

Validation Output:

{
  "total_entries": 150,
  "valid_entries": 145,
  "errors": [
    {
      "citation_key": "Smith2023",
      "error_type": "missing_field",
      "field": "journal",
      "severity": "high"
    },
    {
      "citation_key": "Jones2022",
      "error_type": "invalid_doi",
      "doi": "10.1234/broken",
      "severity": "high"
    }
  ],
  "warnings": [
    {
      "citation_key": "Brown2021",
      "warning_type": "possible_duplicate",
      "duplicate_of": "Brown2021a",
      "severity": "medium"
    }
  ]
}

Phase 5: Integration with Writing Workflow

Building References for Manuscripts

Complete workflow for creating a bibliography:

# 1. Search for papers on your topic
python scripts/search_pubmed.py \
  '"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]' \
  --date-start 2020 \
  --limit 200 \
  --output crispr_papers.json

# 2. Extract DOIs from search results and convert to BibTeX
python scripts/extract_metadata.py \
  --input crispr_papers.json \
  --output crispr_refs.bib

# 3. Add specific papers by DOI
python scripts/doi_to_bibtex.py 10.1038/nature12345 >> crispr_refs.bib
python scripts/doi_to_bibtex.py 10.1126/science.abcd1234 >> crispr_refs.bib

# 4. Format and clean the BibTeX file
python scripts/format_bibtex.py crispr_refs.bib \
  --deduplicate \
  --sort year \
  --descending \
  --output references.bib

# 5. Validate all citations
python scripts/validate_citations.py references.bib \
  --auto-fix \
  --report validation.json \
  --output final_references.bib

# 6. Review validation report and fix any remaining issues
cat validation.json

# 7. Use in your LaTeX document
# \bibliography{final_references}

Integration with Literature Review Skill

This skill complements the literature-review skill:

Literature Review Skill → Systematic search and synthesis Citation Management Skill → Technical citation handling

Combined Workflow:

  1. Use literature-review for comprehensive multi-database search
  2. Use citation-management to extract and validate all citations
  3. Use literature-review to synthesize findings thematically
  4. Use citation-management to verify final bibliography accuracy
# After completing literature review
# Verify all citations in the review document
python scripts/validate_citations.py my_review_references.bib --report review_validation.json

# Format for specific citation style if needed
python scripts/format_bibtex.py my_review_references.bib \
  --style nature \
  --output formatted_refs.bib

Search Strategies

Google Scholar Best Practices

Finding Seminal and High-Impact Papers (CRITICAL):

Always prioritize papers based on citation count, venue quality, and author reputation:

Citation Count Thresholds:

Paper AgeCitationsClassification
0-3 years20+Noteworthy
0-3 years100+Highly Influential
3-7 years100+Significant
3-7 years500+Landmark Paper
7+ years500+Seminal Work
7+ years1000+Foundational

Venue Quality Tiers:

  • Tier 1 (Prefer): Nature, Science, Cell, NEJM, Lancet, JAMA, PNAS
  • Tier 2 (High Priority): Impact Factor >10, top conferences (NeurIPS, ICML, ICLR)
  • Tier 3 (Good): Specialized journals (IF 5-10)
  • Tier 4 (Sparingly): Lower-impact peer-reviewed venues

Author Reputation Indicators:

  • Senior researchers with h-index >40
  • Multiple publications in Tier-1 venues
  • Leadership at recognized institutions
  • Awards and editorial positions

Search Strategies for High-Impact Papers:

  • Sort by citation count (most cited first)
  • Look for review articles from Tier-1 journals for overview
  • Check “Cited by” for impact assessment and recent follow-up work
  • Use citation alerts for tracking new citations to key papers
  • Filter by top venues using source:Nature or source:Science
  • Search for papers by known field leaders using author:LastName

Advanced Operators (full list in references/google_scholar_search.md):

"exact phrase"           # Exact phrase matching
author:lastname          # Search by author
intitle:keyword          # Search in title only
source:journal           # Search specific journal
-exclude                 # Exclude terms
OR                       # Alternative terms
2020..2024              # Year range

Example Searches:

# Find recent reviews on a topic
"CRISPR" intitle:review 2023..2024

# Find papers by specific author on topic
author:Church "synthetic biology"

# Find highly cited foundational work
"deep learning" 2012..2015 sort:citations

# Exclude surveys and focus on methods
"protein folding" -survey -review intitle:method

PubMed Best Practices

Using MeSH Terms: MeSH (Medical Subject Headings) provides controlled vocabulary for precise searching.

  1. Find MeSH terms at https://meshb.nlm.nih.gov/search
  2. Use in queries: "Diabetes Mellitus, Type 2"[MeSH]
  3. Combine with keywords for comprehensive coverage

Field Tags:

[Title]              # Search in title only
[Title/Abstract]     # Search in title or abstract
[Author]             # Search by author name
[Journal]            # Search specific journal
[Publication Date]   # Date range
[Publication Type]   # Article type
[MeSH]              # MeSH term

Building Complex Queries:

# Clinical trials on diabetes treatment published recently
"Diabetes Mellitus, Type 2"[MeSH] AND "Drug Therapy"[MeSH] 
AND "Clinical Trial"[Publication Type] AND 2020:2024[Publication Date]

# Reviews on CRISPR in specific journal
"CRISPR-Cas Systems"[MeSH] AND "Nature"[Journal] AND "Review"[Publication Type]

# Specific author's recent work
"Smith AB"[Author] AND cancer[Title/Abstract] AND 2022:2024[Publication Date]

E-utilities for Automation: The scripts use NCBI E-utilities API for programmatic access:

  • ESearch: Search and retrieve PMIDs
  • EFetch: Retrieve full metadata
  • ESummary: Get summary information
  • ELink: Find related articles

See references/pubmed_search.md for complete API documentation.

Tools and Scripts

search_google_scholar.py

Search Google Scholar and export results.

Features:

  • Automated searching with rate limiting
  • Pagination support
  • Year range filtering
  • Export to JSON or BibTeX
  • Citation count information

Usage:

# Basic search
python scripts/search_google_scholar.py "quantum computing"

# Advanced search with filters
python scripts/search_google_scholar.py "quantum computing" \
  --year-start 2020 \
  --year-end 2024 \
  --limit 100 \
  --sort-by citations \
  --output quantum_papers.json

# Export directly to BibTeX
python scripts/search_google_scholar.py "machine learning" \
  --limit 50 \
  --format bibtex \
  --output ml_papers.bib

search_pubmed.py

Search PubMed using E-utilities API.

Features:

  • Complex query support (MeSH, field tags, Boolean)
  • Date range filtering
  • Publication type filtering
  • Batch retrieval with metadata
  • Export to JSON or BibTeX

Usage:

# Simple keyword search
python scripts/search_pubmed.py "CRISPR gene editing"

# Complex query with filters
python scripts/search_pubmed.py \
  --query '"CRISPR-Cas Systems"[MeSH] AND "therapeutic"[Title/Abstract]' \
  --date-start 2020-01-01 \
  --date-end 2024-12-31 \
  --publication-types "Clinical Trial,Review" \
  --limit 200 \
  --output crispr_therapeutic.json

# Export to BibTeX
python scripts/search_pubmed.py "Alzheimer's disease" \
  --limit 100 \
  --format bibtex \
  --output alzheimers.bib

extract_metadata.py

Extract complete metadata from paper identifiers.

Features:

  • Supports DOI, PMID, arXiv ID, URL
  • Queries CrossRef, PubMed, arXiv APIs
  • Handles multiple identifier types
  • Batch processing
  • Multiple output formats

Usage:

# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2

# Single PMID
python scripts/extract_metadata.py --pmid 34265844

# Single arXiv ID
python scripts/extract_metadata.py --arxiv 2103.14030

# From URL
python scripts/extract_metadata.py \
  --url "https://www.nature.com/articles/s41586-021-03819-2"

# Batch processing (file with one identifier per line)
python scripts/extract_metadata.py \
  --input paper_ids.txt \
  --output references.bib

# Different output formats
python scripts/extract_metadata.py \
  --doi 10.1038/nature12345 \
  --format json  # or bibtex, yaml

validate_citations.py

Validate BibTeX entries for accuracy and completeness.

Features:

  • DOI verification via doi.org and CrossRef
  • Required field checking
  • Duplicate detection
  • Format validation
  • Auto-fix common issues
  • Detailed reporting

Usage:

# Basic validation
python scripts/validate_citations.py references.bib

# With auto-fix
python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output fixed_references.bib

# Detailed validation report
python scripts/validate_citations.py references.bib \
  --report validation_report.json \
  --verbose

# Only check DOIs
python scripts/validate_citations.py references.bib \
  --check-dois-only

format_bibtex.py

Format and clean BibTeX files.

Features:

  • Standardize formatting
  • Sort entries (by key, year, author)
  • Remove duplicates
  • Validate syntax
  • Fix common errors
  • Enforce citation key conventions

Usage:

# Basic formatting
python scripts/format_bibtex.py references.bib

# Sort by year (newest first)
python scripts/format_bibtex.py references.bib \
  --sort year \
  --descending \
  --output sorted_refs.bib

# Remove duplicates
python scripts/format_bibtex.py references.bib \
  --deduplicate \
  --output clean_refs.bib

# Complete cleanup
python scripts/format_bibtex.py references.bib \
  --deduplicate \
  --sort year \
  --validate \
  --auto-fix \
  --output final_refs.bib

doi_to_bibtex.py

Quick DOI to BibTeX conversion.

Features:

  • Fast single DOI conversion
  • Batch processing
  • Multiple output formats
  • Clipboard support

Usage:

# Single DOI
python scripts/doi_to_bibtex.py 10.1038/s41586-021-03819-2

# Multiple DOIs
python scripts/doi_to_bibtex.py \
  10.1038/nature12345 \
  10.1126/science.abc1234 \
  10.1016/j.cell.2023.01.001

# From file (one DOI per line)
python scripts/doi_to_bibtex.py --input dois.txt --output references.bib

# Copy to clipboard
python scripts/doi_to_bibtex.py 10.1038/nature12345 --clipboard

Best Practices

Search Strategy

  1. Start broad, then narrow:

    • Begin with general terms to understand the field
    • Refine with specific keywords and filters
    • Use synonyms and related terms
  2. Use multiple sources:

    • Google Scholar for comprehensive coverage
    • PubMed for biomedical focus
    • arXiv for preprints
    • Combine results for completeness
  3. Leverage citations:

    • Check “Cited by” for seminal papers
    • Review references from key papers
    • Use citation networks to discover related work
  4. Document your searches:

    • Save search queries and dates
    • Record number of results
    • Note any filters or restrictions applied

Metadata Extraction

  1. Always use DOIs when available:

    • Most reliable identifier
    • Permanent link to the publication
    • Best metadata source via CrossRef
  2. Verify extracted metadata:

    • Check author names are correct
    • Verify journal/conference names
    • Confirm publication year
    • Validate page numbers and volume
  3. Handle edge cases:

    • Preprints: Include repository and ID
    • Preprints later published: Use published version
    • Conference papers: Include conference name and location
    • Book chapters: Include book title and editors
  4. Maintain consistency:

    • Use consistent author name format
    • Standardize journal abbreviations
    • Use same DOI format (URL preferred)

BibTeX Quality

  1. Follow conventions:

    • Use meaningful citation keys (FirstAuthor2024keyword)
    • Protect capitalization in titles with {}
    • Use — for page ranges (not single dash)
    • Include DOI field for all modern publications
  2. Keep it clean:

    • Remove unnecessary fields
    • No redundant information
    • Consistent formatting
    • Validate syntax regularly
  3. Organize systematically:

    • Sort by year or topic
    • Group related papers
    • Use separate files for different projects
    • Merge carefully to avoid duplicates

Validation

  1. Validate early and often:

    • Check citations when adding them
    • Validate complete bibliography before submission
    • Re-validate after any manual edits
  2. Fix issues promptly:

    • Broken DOIs: Find correct identifier
    • Missing fields: Extract from original source
    • Duplicates: Choose best version, remove others
    • Format errors: Use auto-fix when safe
  3. Manual review for critical citations:

    • Verify key papers cited correctly
    • Check author names match publication
    • Confirm page numbers and volume
    • Ensure URLs are current

Common Pitfalls to Avoid

  1. Single source bias: Only using Google Scholar or PubMed

    • Solution: Search multiple databases for comprehensive coverage
  2. Accepting metadata blindly: Not verifying extracted information

    • Solution: Spot-check extracted metadata against original sources
  3. Ignoring DOI errors: Broken or incorrect DOIs in bibliography

    • Solution: Run validation before final submission
  4. Inconsistent formatting: Mixed citation key styles, formatting

    • Solution: Use format_bibtex.py to standardize
  5. Duplicate entries: Same paper cited multiple times with different keys

    • Solution: Use duplicate detection in validation
  6. Missing required fields: Incomplete BibTeX entries

    • Solution: Validate and ensure all required fields present
  7. Outdated preprints: Citing preprint when published version exists

    • Solution: Check if preprints have been published, update to journal version
  8. Special character issues: Broken LaTeX compilation due to characters

    • Solution: Use proper escaping or Unicode in BibTeX
  9. No validation before submission: Submitting with citation errors

    • Solution: Always run validation as final check
  10. Manual BibTeX entry: Typing entries by hand

    • Solution: Always extract from metadata sources using scripts

Example Workflows

Example 1: Building a Bibliography for a Paper

# Step 1: Find key papers on your topic
python scripts/search_google_scholar.py "transformer neural networks" \
  --year-start 2017 \
  --limit 50 \
  --output transformers_gs.json

python scripts/search_pubmed.py "deep learning medical imaging" \
  --date-start 2020 \
  --limit 50 \
  --output medical_dl_pm.json

# Step 2: Extract metadata from search results
python scripts/extract_metadata.py \
  --input transformers_gs.json \
  --output transformers.bib

python scripts/extract_metadata.py \
  --input medical_dl_pm.json \
  --output medical.bib

# Step 3: Add specific papers you already know
python scripts/doi_to_bibtex.py 10.1038/s41586-021-03819-2 >> specific.bib
python scripts/doi_to_bibtex.py 10.1126/science.aam9317 >> specific.bib

# Step 4: Combine all BibTeX files
cat transformers.bib medical.bib specific.bib > combined.bib

# Step 5: Format and deduplicate
python scripts/format_bibtex.py combined.bib \
  --deduplicate \
  --sort year \
  --descending \
  --output formatted.bib

# Step 6: Validate
python scripts/validate_citations.py formatted.bib \
  --auto-fix \
  --report validation.json \
  --output final_references.bib

# Step 7: Review any issues
cat validation.json | grep -A 3 '"errors"'

# Step 8: Use in LaTeX
# \bibliography{final_references}

Example 2: Converting a List of DOIs

# You have a text file with DOIs (one per line)
# dois.txt contains:
# 10.1038/s41586-021-03819-2
# 10.1126/science.aam9317
# 10.1016/j.cell.2023.01.001

# Convert all to BibTeX
python scripts/doi_to_bibtex.py --input dois.txt --output references.bib

# Validate the result
python scripts/validate_citations.py references.bib --verbose

Example 3: Cleaning an Existing BibTeX File

# You have a messy BibTeX file from various sources
# Clean it up systematically

# Step 1: Format and standardize
python scripts/format_bibtex.py messy_references.bib \
  --output step1_formatted.bib

# Step 2: Remove duplicates
python scripts/format_bibtex.py step1_formatted.bib \
  --deduplicate \
  --output step2_deduplicated.bib

# Step 3: Validate and auto-fix
python scripts/validate_citations.py step2_deduplicated.bib \
  --auto-fix \
  --output step3_validated.bib

# Step 4: Sort by year
python scripts/format_bibtex.py step3_validated.bib \
  --sort year \
  --descending \
  --output clean_references.bib

# Step 5: Final validation report
python scripts/validate_citations.py clean_references.bib \
  --report final_validation.json \
  --verbose

# Review report
cat final_validation.json

Example 4: Finding and Citing Seminal Papers

# Find highly cited papers on a topic
python scripts/search_google_scholar.py "AlphaFold protein structure" \
  --year-start 2020 \
  --year-end 2024 \
  --sort-by citations \
  --limit 20 \
  --output alphafold_seminal.json

# Extract the top 10 by citation count
# (script will have included citation counts in JSON)

# Convert to BibTeX
python scripts/extract_metadata.py \
  --input alphafold_seminal.json \
  --output alphafold_refs.bib

# The BibTeX file now contains the most influential papers

Integration with Other Skills

Literature Review Skill

Citation Management provides the technical infrastructure for Literature Review:

  • Literature Review: Multi-database systematic search and synthesis
  • Citation Management: Metadata extraction and validation

Combined workflow:

  1. Use literature-review for systematic search methodology
  2. Use citation-management to extract and validate citations
  3. Use literature-review to synthesize findings
  4. Use citation-management to ensure bibliography accuracy

Scientific Writing Skill

Citation Management ensures accurate references for Scientific Writing:

  • Export validated BibTeX for use in LaTeX manuscripts
  • Verify citations match publication standards
  • Format references according to journal requirements

Venue Templates Skill

Citation Management works with Venue Templates for submission-ready manuscripts:

  • Different venues require different citation styles
  • Generate properly formatted references
  • Validate citations meet venue requirements

Resources

Bundled Resources

References (in references/):

  • google_scholar_search.md: Complete Google Scholar search guide
  • pubmed_search.md: PubMed and E-utilities API documentation
  • metadata_extraction.md: Metadata sources and field requirements
  • citation_validation.md: Validation criteria and quality checks
  • bibtex_formatting.md: BibTeX entry types and formatting rules

Scripts (in scripts/):

  • search_google_scholar.py: Google Scholar search automation
  • search_pubmed.py: PubMed E-utilities API client
  • extract_metadata.py: Universal metadata extractor
  • validate_citations.py: Citation validation and verification
  • format_bibtex.py: BibTeX formatter and cleaner
  • doi_to_bibtex.py: Quick DOI to BibTeX converter

Assets (in assets/):

  • bibtex_template.bib: Example BibTeX entries for all types
  • citation_checklist.md: Quality assurance checklist

External Resources

Search Engines:

Metadata APIs:

Tools and Validators:

Citation Styles:

Dependencies

Required Python Packages

# Core dependencies
pip install requests  # HTTP requests for APIs
pip install bibtexparser  # BibTeX parsing and formatting
pip install biopython  # PubMed E-utilities access

# Optional (for Google Scholar)
pip install scholarly  # Google Scholar API wrapper
# or
pip install selenium  # For more robust Scholar scraping

Optional Tools

# For advanced validation
pip install crossref-commons  # Enhanced CrossRef API access
pip install pylatexenc  # LaTeX special character handling

Summary

The citation-management skill provides:

  1. Comprehensive search capabilities for Google Scholar and PubMed
  2. Automated metadata extraction from DOI, PMID, arXiv ID, URLs
  3. Citation validation with DOI verification and completeness checking
  4. BibTeX formatting with standardization and cleaning tools
  5. Quality assurance through validation and reporting
  6. Integration with scientific writing workflow
  7. Reproducibility through documented search and extraction methods

Use this skill to maintain accurate, complete citations throughout your research and ensure publication-ready bibliographies.


Reference: Bibtex_Formatting

BibTeX Formatting Guide

Comprehensive guide to BibTeX entry types, required fields, formatting conventions, and best practices.

Overview

BibTeX is the standard bibliography format for LaTeX documents. Proper formatting ensures:

  • Correct citation rendering
  • Consistent formatting
  • Compatibility with citation styles
  • No compilation errors

This guide covers all common entry types and formatting rules.

Entry Types

@article - Journal Articles

Most common entry type for peer-reviewed journal articles.

Required fields:

  • author: Author names
  • title: Article title
  • journal: Journal name
  • year: Publication year

Optional fields:

  • volume: Volume number
  • number: Issue number
  • pages: Page range
  • month: Publication month
  • doi: Digital Object Identifier
  • url: URL
  • note: Additional notes

Template:

@article{CitationKey2024,
  author  = {Last1, First1 and Last2, First2},
  title   = {Article Title Here},
  journal = {Journal Name},
  year    = {2024},
  volume  = {10},
  number  = {3},
  pages   = {123--145},
  doi     = {10.1234/journal.2024.123456},
  month   = jan
}

Example:

@article{Jumper2021,
  author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and others},
  title   = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
  journal = {Nature},
  year    = {2021},
  volume  = {596},
  number  = {7873},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

@book - Books

For entire books.

Required fields:

  • author OR editor: Author(s) or editor(s)
  • title: Book title
  • publisher: Publisher name
  • year: Publication year

Optional fields:

  • volume: Volume number (if multi-volume)
  • series: Series name
  • address: Publisher location
  • edition: Edition number
  • isbn: ISBN
  • url: URL

Template:

@book{CitationKey2024,
  author    = {Last, First},
  title     = {Book Title},
  publisher = {Publisher Name},
  year      = {2024},
  edition   = {3},
  address   = {City, Country},
  isbn      = {978-0-123-45678-9}
}

Example:

@book{Kumar2021,
  author    = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
  title     = {Robbins and Cotran Pathologic Basis of Disease},
  publisher = {Elsevier},
  year      = {2021},
  edition   = {10},
  address   = {Philadelphia, PA},
  isbn      = {978-0-323-53113-9}
}

@inproceedings - Conference Papers

For papers in conference proceedings.

Required fields:

  • author: Author names
  • title: Paper title
  • booktitle: Conference/proceedings name
  • year: Year

Optional fields:

  • editor: Proceedings editor(s)
  • volume: Volume number
  • series: Series name
  • pages: Page range
  • address: Conference location
  • month: Conference month
  • organization: Organizing body
  • publisher: Publisher
  • doi: DOI

Template:

@inproceedings{CitationKey2024,
  author    = {Last, First},
  title     = {Paper Title},
  booktitle = {Proceedings of Conference Name},
  year      = {2024},
  pages     = {123--145},
  address   = {City, Country},
  month     = jun
}

Example:

@inproceedings{Vaswani2017,
  author    = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
  title     = {Attention is All You Need},
  booktitle = {Advances in Neural Information Processing Systems 30 (NeurIPS 2017)},
  year      = {2017},
  pages     = {5998--6008},
  address   = {Long Beach, CA}
}

Note: @conference is an alias for @inproceedings.

@incollection - Book Chapters

For chapters in edited books.

Required fields:

  • author: Chapter author(s)
  • title: Chapter title
  • booktitle: Book title
  • publisher: Publisher name
  • year: Publication year

Optional fields:

  • editor: Book editor(s)
  • volume: Volume number
  • series: Series name
  • type: Type of section (e.g., “chapter”)
  • chapter: Chapter number
  • pages: Page range
  • address: Publisher location
  • edition: Edition
  • month: Month

Template:

@incollection{CitationKey2024,
  author    = {Last, First},
  title     = {Chapter Title},
  booktitle = {Book Title},
  editor    = {Editor, Last and Editor2, Last},
  publisher = {Publisher Name},
  year      = {2024},
  pages     = {123--145},
  chapter   = {5}
}

Example:

@incollection{Brown2020,
  author    = {Brown, Peter O. and Botstein, David},
  title     = {Exploring the New World of the Genome with {DNA} Microarrays},
  booktitle = {DNA Microarrays: A Molecular Cloning Manual},
  editor    = {Eisen, Michael B. and Brown, Patrick O.},
  publisher = {Cold Spring Harbor Laboratory Press},
  year      = {2020},
  pages     = {1--45},
  address   = {Cold Spring Harbor, NY}
}

@phdthesis - Doctoral Dissertations

For PhD dissertations and theses.

Required fields:

  • author: Author name
  • title: Thesis title
  • school: Institution
  • year: Year

Optional fields:

  • type: Type (e.g., “PhD dissertation”, “PhD thesis”)
  • address: Institution location
  • month: Month
  • url: URL
  • note: Additional notes

Template:

@phdthesis{CitationKey2024,
  author = {Last, First},
  title  = {Dissertation Title},
  school = {University Name},
  year   = {2024},
  type   = {{PhD} dissertation},
  address = {City, State}
}

Example:

@phdthesis{Johnson2023,
  author  = {Johnson, Mary L.},
  title   = {Novel Approaches to Cancer Immunotherapy Using {CRISPR} Technology},
  school  = {Stanford University},
  year    = {2023},
  type    = {{PhD} dissertation},
  address = {Stanford, CA}
}

Note: @mastersthesis is similar but for Master’s theses.

@mastersthesis - Master’s Theses

For Master’s theses.

Required fields:

  • author: Author name
  • title: Thesis title
  • school: Institution
  • year: Year

Template:

@mastersthesis{CitationKey2024,
  author = {Last, First},
  title  = {Thesis Title},
  school = {University Name},
  year   = {2024}
}

@misc - Miscellaneous

For items that don’t fit other categories (preprints, datasets, software, websites, etc.).

Required fields:

  • author (if known)
  • title
  • year

Optional fields:

  • howpublished: Repository, website, format
  • url: URL
  • doi: DOI
  • note: Additional information
  • month: Month

Template for preprints:

@misc{CitationKey2024,
  author       = {Last, First},
  title        = {Preprint Title},
  year         = {2024},
  howpublished = {bioRxiv},
  doi          = {10.1101/2024.01.01.123456},
  note         = {Preprint}
}

Template for datasets:

@misc{DatasetName2024,
  author       = {Last, First},
  title        = {Dataset Title},
  year         = {2024},
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.123456},
  note         = {Version 1.2}
}

Template for software:

@misc{SoftwareName2024,
  author       = {Last, First},
  title        = {Software Name},
  year         = {2024},
  howpublished = {GitHub},
  url          = {https://github.com/user/repo},
  note         = {Version 2.0}
}

@techreport - Technical Reports

For technical reports.

Required fields:

  • author: Author name(s)
  • title: Report title
  • institution: Institution
  • year: Year

Optional fields:

  • type: Type of report
  • number: Report number
  • address: Institution location
  • month: Month

Template:

@techreport{CitationKey2024,
  author      = {Last, First},
  title       = {Report Title},
  institution = {Institution Name},
  year        = {2024},
  type        = {Technical Report},
  number      = {TR-2024-01}
}

@unpublished - Unpublished Work

For unpublished works (not preprints - use @misc for those).

Required fields:

  • author: Author name(s)
  • title: Work title
  • note: Description

Optional fields:

  • month: Month
  • year: Year

Template:

@unpublished{CitationKey2024,
  author = {Last, First},
  title  = {Work Title},
  note   = {Unpublished manuscript},
  year   = {2024}
}

@online/@electronic - Online Resources

For web pages and online-only content.

Note: Not standard BibTeX, but supported by many bibliography packages (biblatex).

Required fields:

  • author OR organization
  • title
  • url
  • year

Template:

@online{CitationKey2024,
  author = {{Organization Name}},
  title  = {Page Title},
  url    = {https://example.com/page},
  year   = {2024},
  note   = {Accessed: 2024-01-15}
}

Formatting Rules

Citation Keys

Convention: FirstAuthorYEARkeyword

Examples:

Smith2024protein
Doe2023machine
JohnsonWilliams2024cancer  % Multiple authors, no space
NatureEditorial2024        % No author, use publication
WHO2024guidelines          % Organization author

Rules:

  • Alphanumeric plus: -, _, ., :
  • No spaces
  • Case-sensitive
  • Unique within file
  • Descriptive

Avoid:

  • Special characters: @, #, &, %, $
  • Spaces: use CamelCase or underscores
  • Starting with numbers: 2024Smith (some systems disallow)

Author Names

Recommended format: Last, First Middle

Single author:

author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}

Multiple authors - separate with and:

author = {Smith, John and Doe, Jane}
author = {Smith, John A. and Doe, Jane M. and Johnson, Mary L.}

Many authors (10+):

author = {Smith, John and Doe, Jane and Johnson, Mary and others}

Special cases:

% Suffix (Jr., III, etc.)
author = {King, Jr., Martin Luther}

% Organization as author
author = {{World Health Organization}}
% Note: Double braces keep as single entity

% Multiple surnames
author = {Garc{\'i}a-Mart{\'i}nez, Jos{\'e}}

% Particles (van, von, de, etc.)
author = {van der Waals, Johannes}
author = {de Broglie, Louis}

Wrong formats (don’t use):

author = {Smith, J.; Doe, J.}  % Semicolons (wrong)
author = {Smith, J., Doe, J.}  % Commas (wrong)
author = {Smith, J. & Doe, J.} % Ampersand (wrong)
author = {Smith J}             % No comma

Title Capitalization

Protect capitalization with braces:

% Proper nouns, acronyms, formulas
title = {{AlphaFold}: Protein Structure Prediction}
title = {Machine Learning for {DNA} Sequencing}
title = {The {Ising} Model in Statistical Physics}
title = {{CRISPR-Cas9} Gene Editing Technology}

Reason: Citation styles may change capitalization. Braces protect.

Examples:

% Good
title = {Advances in {COVID-19} Treatment}
title = {Using {Python} for Data Analysis}
title = {The {AlphaFold} Protein Structure Database}

% Will be lowercase in title case styles
title = {Advances in COVID-19 Treatment}  % covid-19
title = {Using Python for Data Analysis}  % python

Whole title protection (rarely needed):

title = {{This Entire Title Keeps Its Capitalization}}

Page Ranges

Use en-dash (double hyphen --):

pages = {123--145}     % Correct
pages = {1234--1256}   % Correct
pages = {e0123456}     % Article ID (PLOS, etc.)
pages = {123}          % Single page

Wrong:

pages = {123-145}      % Single hyphen (don't use)
pages = {pp. 123-145}  % "pp." not needed
pages = {123–145}      % Unicode en-dash (may cause issues)

Month Names

Use three-letter abbreviations (unquoted):

month = jan
month = feb
month = mar
month = apr
month = may
month = jun
month = jul
month = aug
month = sep
month = oct
month = nov
month = dec

Or numeric:

month = {1}   % January
month = {12}  % December

Or full name in braces:

month = {January}

Standard abbreviations work without quotes because they’re defined in BibTeX.

Journal Names

Full name (not abbreviated):

journal = {Nature}
journal = {Science}
journal = {Cell}
journal = {Proceedings of the National Academy of Sciences}
journal = {Journal of the American Chemical Society}

Bibliography style will handle abbreviation if needed.

Avoid manual abbreviation:

% Don't do this in BibTeX file
journal = {Proc. Natl. Acad. Sci. U.S.A.}

% Do this instead
journal = {Proceedings of the National Academy of Sciences}

Exception: If style requires abbreviations, use full abbreviated form:

journal = {Proc. Natl. Acad. Sci. U.S.A.}  % If required by style

DOI Formatting

URL format (preferred):

doi = {10.1038/s41586-021-03819-2}

Not:

doi = {https://doi.org/10.1038/s41586-021-03819-2}  % Don't include URL
doi = {doi:10.1038/s41586-021-03819-2}              % Don't include prefix

LaTeX will format as URL automatically.

Note: No period after DOI field!

URL Formatting

url = {https://www.example.com/article}

Use:

  • When DOI not available
  • For web pages
  • For supplementary materials

Don’t duplicate:

% Don't include both if DOI URL is same as url
doi = {10.1038/nature12345}
url = {https://doi.org/10.1038/nature12345}  % Redundant!

Special Characters

Accents and diacritics:

author = {M{\"u}ller, Hans}        % ü
author = {Garc{\'i}a, Jos{\'e}}    % í, é
author = {Erd{\H{o}}s, Paul}       % ő
author = {Schr{\"o}dinger, Erwin}  % ö

Or use UTF-8 (with proper LaTeX setup):

author = {Müller, Hans}
author = {García, José}

Mathematical symbols:

title = {The $\alpha$-helix Structure}
title = {$\beta$-sheet Prediction}

Chemical formulas:

title = {H$_2$O Molecular Dynamics}
% Or with chemformula package:
title = {\ce{H2O} Molecular Dynamics}

Field Order

Recommended order (for readability):

@article{Key,
  author  = {},
  title   = {},
  journal = {},
  year    = {},
  volume  = {},
  number  = {},
  pages   = {},
  doi     = {},
  url     = {},
  note    = {}
}

Rules:

  • Most important fields first
  • Consistent across entries
  • Use formatter to standardize

Best Practices

1. Consistent Formatting

Use same format throughout:

  • Author name format
  • Title capitalization
  • Journal names
  • Citation key style

2. Required Fields

Always include:

  • All required fields for entry type
  • DOI for modern papers (2000+)
  • Volume and pages for articles
  • Publisher for books

3. Protect Capitalization

Use braces for:

  • Proper nouns: {AlphaFold}
  • Acronyms: {DNA}, {CRISPR}
  • Formulas: {H2O}
  • Names: {Python}, {R}

4. Complete Author Lists

Include all authors when possible:

  • All authors if <10
  • Use “and others” for 10+
  • Don’t abbreviate to “et al.” manually

5. Use Standard Entry Types

Choose correct entry type:

  • Journal article → @article
  • Book → @book
  • Conference paper → @inproceedings
  • Preprint → @misc

6. Validate Syntax

Check for:

  • Balanced braces
  • Commas after fields
  • Unique citation keys
  • Valid entry types

7. Use Formatters

Use automated tools:

python scripts/format_bibtex.py references.bib

Benefits:

  • Consistent formatting
  • Catch syntax errors
  • Standardize field order
  • Fix common issues

Common Mistakes

1. Wrong Author Separator

Wrong:

author = {Smith, J.; Doe, J.}    % Semicolon
author = {Smith, J., Doe, J.}    % Comma
author = {Smith, J. & Doe, J.}   % Ampersand

Correct:

author = {Smith, John and Doe, Jane}

2. Missing Commas

Wrong:

@article{Smith2024,
  author = {Smith, John}    % Missing comma!
  title = {Title}
}

Correct:

@article{Smith2024,
  author = {Smith, John},   % Comma after each field
  title = {Title}
}

3. Unprotected Capitalization

Wrong:

title = {Machine Learning with Python}
% "Python" will become "python" in title case

Correct:

title = {Machine Learning with {Python}}

4. Single Hyphen in Pages

Wrong:

pages = {123-145}   % Single hyphen

Correct:

pages = {123--145}  % Double hyphen (en-dash)

5. Redundant “pp.” in Pages

Wrong:

pages = {pp. 123--145}

Correct:

pages = {123--145}

6. DOI with URL Prefix

Wrong:

doi = {https://doi.org/10.1038/nature12345}
doi = {doi:10.1038/nature12345}

Correct:

doi = {10.1038/nature12345}

Example Complete Bibliography

% Journal article
@article{Jumper2021,
  author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and others},
  title   = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
  journal = {Nature},
  year    = {2021},
  volume  = {596},
  number  = {7873},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

% Book
@book{Kumar2021,
  author    = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
  title     = {Robbins and Cotran Pathologic Basis of Disease},
  publisher = {Elsevier},
  year      = {2021},
  edition   = {10},
  address   = {Philadelphia, PA},
  isbn      = {978-0-323-53113-9}
}

% Conference paper
@inproceedings{Vaswani2017,
  author    = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
  title     = {Attention is All You Need},
  booktitle = {Advances in Neural Information Processing Systems 30 (NeurIPS 2017)},
  year      = {2017},
  pages     = {5998--6008}
}

% Book chapter
@incollection{Brown2020,
  author    = {Brown, Peter O. and Botstein, David},
  title     = {Exploring the New World of the Genome with {DNA} Microarrays},
  booktitle = {DNA Microarrays: A Molecular Cloning Manual},
  editor    = {Eisen, Michael B. and Brown, Patrick O.},
  publisher = {Cold Spring Harbor Laboratory Press},
  year      = {2020},
  pages     = {1--45}
}

% PhD thesis
@phdthesis{Johnson2023,
  author  = {Johnson, Mary L.},
  title   = {Novel Approaches to Cancer Immunotherapy},
  school  = {Stanford University},
  year    = {2023},
  type    = {{PhD} dissertation}
}

% Preprint
@misc{Zhang2024,
  author       = {Zhang, Yi and Chen, Li and Wang, Hui},
  title        = {Novel Therapeutic Targets in {Alzheimer}'s Disease},
  year         = {2024},
  howpublished = {bioRxiv},
  doi          = {10.1101/2024.01.001},
  note         = {Preprint}
}

% Dataset
@misc{AlphaFoldDB2021,
  author       = {{DeepMind} and {EMBL-EBI}},
  title        = {{AlphaFold} Protein Structure Database},
  year         = {2021},
  howpublished = {Database},
  url          = {https://alphafold.ebi.ac.uk/},
  doi          = {10.1093/nar/gkab1061}
}

Summary

BibTeX formatting essentials:

Choose correct entry type (@article, @book, etc.)
Include all required fields
Use and for multiple authors
Protect capitalization with braces
Use -- for page ranges
Include DOI for modern papers
Validate syntax before compilation

Use formatting tools to ensure consistency:

python scripts/format_bibtex.py references.bib

Properly formatted BibTeX ensures correct, consistent citations across all bibliography styles!


Reference: Citation_Validation

Citation Validation Guide

Comprehensive guide to validating citation accuracy, completeness, and formatting in BibTeX files.

Overview

Citation validation ensures:

  • All citations are accurate and complete
  • DOIs resolve correctly
  • Required fields are present
  • No duplicate entries
  • Proper formatting and syntax
  • Links are accessible

Validation should be performed:

  • After extracting metadata
  • Before manuscript submission
  • After manual edits to BibTeX files
  • Periodically for maintained bibliographies

Validation Categories

1. DOI Verification

Purpose: Ensure DOIs are valid and resolve correctly.

What to Check

DOI format:

Valid:   10.1038/s41586-021-03819-2
Valid:   10.1126/science.aam9317
Invalid: 10.1038/invalid
Invalid: doi:10.1038/... (should omit "doi:" prefix in BibTeX)

DOI resolution:

  • DOI should resolve via https://doi.org/
  • Should redirect to actual article
  • Should not return 404 or error

Metadata consistency:

  • CrossRef metadata should match BibTeX
  • Author names should align
  • Title should match
  • Year should match

How to Validate

Manual check:

  1. Copy DOI from BibTeX
  2. Visit https://doi.org/10.1038/nature12345
  3. Verify it redirects to correct article
  4. Check metadata matches

Automated check (recommended):

python scripts/validate_citations.py references.bib --check-dois

Process:

  1. Extract all DOIs from BibTeX file
  2. Query doi.org resolver for each
  3. Query CrossRef API for metadata
  4. Compare metadata with BibTeX entry
  5. Report discrepancies

Common Issues

Broken DOIs:

  • Typos in DOI
  • Publisher changed DOI (rare)
  • Article retracted
  • Solution: Find correct DOI from publisher site

Mismatched metadata:

  • BibTeX has old/incorrect information
  • Solution: Re-extract metadata from CrossRef

Missing DOIs:

  • Older articles may not have DOIs
  • Acceptable for pre-2000 publications
  • Add URL or PMID instead

2. Required Fields

Purpose: Ensure all necessary information is present.

Required by Entry Type

@article:

author   % REQUIRED
title    % REQUIRED
journal  % REQUIRED
year     % REQUIRED
volume   % Highly recommended
pages    % Highly recommended
doi      % Highly recommended for modern papers

@book:

author OR editor  % REQUIRED (at least one)
title            % REQUIRED
publisher        % REQUIRED
year             % REQUIRED
isbn             % Recommended

@inproceedings:

author     % REQUIRED
title      % REQUIRED
booktitle  % REQUIRED (conference/proceedings name)
year       % REQUIRED
pages      % Recommended

@incollection (book chapter):

author     % REQUIRED
title      % REQUIRED (chapter title)
booktitle  % REQUIRED (book title)
publisher  % REQUIRED
year       % REQUIRED
editor     % Recommended
pages      % Recommended

@phdthesis:

author  % REQUIRED
title   % REQUIRED
school  % REQUIRED
year    % REQUIRED

@misc (preprints, datasets, etc.):

author  % REQUIRED
title   % REQUIRED
year    % REQUIRED
howpublished  % Recommended (bioRxiv, Zenodo, etc.)
doi OR url    % At least one required

Validation Script

python scripts/validate_citations.py references.bib --check-required-fields

Output:

Error: Entry 'Smith2024' missing required field 'journal'
Error: Entry 'Doe2023' missing required field 'year'
Warning: Entry 'Jones2022' missing recommended field 'volume'

3. Author Name Formatting

Purpose: Ensure consistent, correct author name formatting.

Proper Format

Recommended BibTeX format:

author = {Last1, First1 and Last2, First2 and Last3, First3}

Examples:

% Correct
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
author = {Smith, John and Doe, Jane}
author = {Smith, John and Doe, Jane and Johnson, Mary}

% For many authors
author = {Smith, John and Doe, Jane and others}

% Incorrect
author = {John Smith}  % First Last format (not recommended)
author = {Smith, J.; Doe, J.}  % Semicolon separator (wrong)
author = {Smith J, Doe J}  % Missing commas

Special Cases

Suffixes (Jr., III, etc.):

author = {King, Jr., Martin Luther}

Multiple surnames (hyphenated):

author = {Smith-Jones, Mary}

Van, von, de, etc.:

author = {van der Waals, Johannes}
author = {de Broglie, Louis}

Organizations as authors:

author = {{World Health Organization}}
% Double braces treat as single author

Validation Checks

Automated validation:

python scripts/validate_citations.py references.bib --check-authors

Checks for:

  • Proper separator (and, not &, ; , etc.)
  • Comma placement
  • Empty author fields
  • Malformed names

4. Data Consistency

Purpose: Ensure all fields contain valid, reasonable values.

Year Validation

Valid years:

year = {2024}    % Current/recent
year = {1953}    % Watson & Crick DNA structure (historical)
year = {1665}    % Hooke's Micrographia (very old)

Invalid years:

year = {24}      % Two digits (ambiguous)
year = {202}     % Typo
year = {2025}    % Future (unless accepted/in press)
year = {0}       % Obviously wrong

Check:

  • Four digits
  • Reasonable range (1600-current+1)
  • Not all zeros

Volume/Number Validation

volume = {123}      % Numeric
volume = {12}       % Valid
number = {3}        % Valid
number = {S1}       % Supplement issue (valid)

Invalid:

volume = {Vol. 123}  % Should be just number
number = {Issue 3}   % Should be just number

Page Range Validation

Correct format:

pages = {123--145}    % En-dash (two hyphens)
pages = {e0123456}    % PLOS-style article ID
pages = {123}         % Single page

Incorrect format:

pages = {123-145}     % Single hyphen (use --)
pages = {pp. 123-145} % Remove "pp."
pages = {123–145}     % Unicode en-dash (may cause issues)

URL Validation

Check:

  • URLs are accessible (return 200 status)
  • HTTPS when available
  • No obvious typos
  • Permanent links (not temporary)

Valid:

url = {https://www.nature.com/articles/nature12345}
url = {https://arxiv.org/abs/2103.14030}

Questionable:

url = {http://...}  % HTTP instead of HTTPS
url = {file:///...} % Local file path
url = {bit.ly/...}  % URL shortener (not permanent)

5. Duplicate Detection

Purpose: Find and remove duplicate entries.

Types of Duplicates

Exact duplicates (same DOI):

@article{Smith2024a,
  doi = {10.1038/nature12345},
  ...
}

@article{Smith2024b,
  doi = {10.1038/nature12345},  % Same DOI!
  ...
}

Near duplicates (similar title/authors):

@article{Smith2024,
  title = {Machine Learning for Drug Discovery},
  ...
}

@article{Smith2024method,
  title = {Machine learning for drug discovery},  % Same, different case
  ...
}

Preprint + Published:

@misc{Smith2023arxiv,
  title = {AlphaFold Results},
  howpublished = {arXiv},
  ...
}

@article{Smith2024,
  title = {AlphaFold Results},  % Same paper, now published
  journal = {Nature},
  ...
}
% Keep published version only

Detection Methods

By DOI (most reliable):

  • Same DOI = exact duplicate
  • Keep one, remove other

By title similarity:

  • Normalize: lowercase, remove punctuation
  • Calculate similarity (e.g., Levenshtein distance)
  • Flag if >90% similar

By author-year-title:

  • Same first author + year + similar title
  • Likely duplicate

Automated detection:

python scripts/validate_citations.py references.bib --check-duplicates

Output:

Warning: Possible duplicate entries:
  - Smith2024a (DOI: 10.1038/nature12345)
  - Smith2024b (DOI: 10.1038/nature12345)
  Recommendation: Keep one entry, remove the other.

6. Format and Syntax

Purpose: Ensure valid BibTeX syntax.

Common Syntax Errors

Missing commas:

@article{Smith2024,
  author = {Smith, John}   % Missing comma!
  title = {Title}
}
% Should be:
  author = {Smith, John},  % Comma after each field

Unbalanced braces:

title = {Title with {Protected} Text  % Missing closing brace
% Should be:
title = {Title with {Protected} Text}

Missing closing brace for entry:

@article{Smith2024,
  author = {Smith, John},
  title = {Title}
  % Missing closing brace!
% Should end with:
}

Invalid characters in keys:

@article{Smith&Doe2024,  % & not allowed in key
  ...
}
% Use:
@article{SmithDoe2024,
  ...
}

BibTeX Syntax Rules

Entry structure:

@TYPE{citationkey,
  field1 = {value1},
  field2 = {value2},
  ...
  fieldN = {valueN}
}

Citation keys:

  • Alphanumeric and some punctuation (-, _, ., :)
  • No spaces
  • Case-sensitive
  • Unique within file

Field values:

  • Enclosed in {braces} or “quotes”
  • Braces preferred for complex text
  • Numbers can be unquoted: year = 2024

Special characters:

  • { and } for grouping
  • \ for LaTeX commands
  • Protect capitalization: {AlphaFold}
  • Accents: {\"u}, {\'e}, {\aa}

Validation

python scripts/validate_citations.py references.bib --check-syntax

Checks:

  • Valid BibTeX structure
  • Balanced braces
  • Proper commas
  • Valid entry types
  • Unique citation keys

Validation Workflow

Step 1: Basic Validation

Run comprehensive validation:

python scripts/validate_citations.py references.bib

Checks all:

  • DOI resolution
  • Required fields
  • Author formatting
  • Data consistency
  • Duplicates
  • Syntax

Step 2: Review Report

Examine validation report:

{
  "total_entries": 150,
  "valid_entries": 140,
  "errors": [
    {
      "entry": "Smith2024",
      "error": "missing_required_field",
      "field": "journal",
      "severity": "high"
    },
    {
      "entry": "Doe2023",
      "error": "invalid_doi",
      "doi": "10.1038/broken",
      "severity": "high"
    }
  ],
  "warnings": [
    {
      "entry": "Jones2022",
      "warning": "missing_recommended_field",
      "field": "volume",
      "severity": "medium"
    }
  ],
  "duplicates": [
    {
      "entries": ["Smith2024a", "Smith2024b"],
      "reason": "same_doi",
      "doi": "10.1038/nature12345"
    }
  ]
}

Step 3: Fix Issues

High-priority (errors):

  1. Add missing required fields
  2. Fix broken DOIs
  3. Remove duplicates
  4. Correct syntax errors

Medium-priority (warnings):

  1. Add recommended fields
  2. Improve author formatting
  3. Fix page ranges

Low-priority:

  1. Standardize formatting
  2. Add URLs for accessibility

Step 4: Auto-Fix

Use auto-fix for safe corrections:

python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output fixed_references.bib

Auto-fix can:

  • Fix page range format (- to —)
  • Remove “pp.” from pages
  • Standardize author separators
  • Fix common syntax errors
  • Normalize field order

Auto-fix cannot:

  • Add missing information
  • Find correct DOIs
  • Determine which duplicate to keep
  • Fix semantic errors

Step 5: Manual Review

Review auto-fixed file:

# Check what changed
diff references.bib fixed_references.bib

# Review specific entries that had errors
grep -A 10 "Smith2024" fixed_references.bib

Step 6: Re-Validate

Validate after fixes:

python scripts/validate_citations.py fixed_references.bib --verbose

Should show:

✓ All DOIs valid
✓ All required fields present
✓ No duplicates found
✓ Syntax valid
✓ 150/150 entries valid

Validation Checklist

Use this checklist before final submission:

DOI Validation

  • All DOIs resolve correctly
  • Metadata matches between BibTeX and CrossRef
  • No broken or invalid DOIs

Completeness

  • All entries have required fields
  • Modern papers (2000+) have DOIs
  • Authors properly formatted
  • Journals/conferences properly named

Consistency

  • Years are 4-digit numbers
  • Page ranges use — not -
  • Volume/number are numeric
  • URLs are accessible

Duplicates

  • No entries with same DOI
  • No near-duplicate titles
  • Preprints updated to published versions

Formatting

  • Valid BibTeX syntax
  • Balanced braces
  • Proper commas
  • Unique citation keys

Final Checks

  • Bibliography compiles without errors
  • All citations in text appear in bibliography
  • All bibliography entries cited in text
  • Citation style matches journal requirements

Best Practices

1. Validate Early and Often

# After extraction
python scripts/extract_metadata.py --doi ... --output refs.bib
python scripts/validate_citations.py refs.bib

# After manual edits
python scripts/validate_citations.py refs.bib

# Before submission
python scripts/validate_citations.py refs.bib --strict

2. Use Automated Tools

Don’t validate manually - use scripts:

  • Faster
  • More comprehensive
  • Catches errors humans miss
  • Generates reports

3. Keep Backup

# Before auto-fix
cp references.bib references_backup.bib

# Run auto-fix
python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output references_fixed.bib

# Review changes
diff references.bib references_fixed.bib

# If satisfied, replace
mv references_fixed.bib references.bib

4. Fix High-Priority First

Priority order:

  1. Syntax errors (prevent compilation)
  2. Missing required fields (incomplete citations)
  3. Broken DOIs (broken links)
  4. Duplicates (confusion, wasted space)
  5. Missing recommended fields
  6. Formatting inconsistencies

5. Document Exceptions

For entries that can’t be fixed:

@article{Old1950,
  author = {Smith, John},
  title = {Title},
  journal = {Obscure Journal},
  year = {1950},
  volume = {12},
  pages = {34--56},
  note = {DOI not available for publications before 2000}
}

6. Validate Against Journal Requirements

Different journals have different requirements:

  • Citation style (numbered, author-year)
  • Abbreviations (journal names)
  • Maximum reference count
  • Format (BibTeX, EndNote, manual)

Check journal author guidelines!

Common Validation Issues

Issue 1: Metadata Mismatch

Problem: BibTeX says 2023, CrossRef says 2024.

Cause:

  • Online-first vs print publication
  • Correction/update
  • Extraction error

Solution:

  1. Check actual article
  2. Use more recent/accurate date
  3. Update BibTeX entry
  4. Re-validate

Issue 2: Special Characters

Problem: LaTeX compilation fails on special characters.

Cause:

  • Accented characters (é, ü, ñ)
  • Chemical formulas (H₂O)
  • Math symbols (α, β, ±)

Solution:

% Use LaTeX commands
author = {M{\"u}ller, Hans}  % Müller
title = {Study of H\textsubscript{2}O}  % H₂O
% Or use UTF-8 with proper LaTeX packages

Issue 3: Incomplete Extraction

Problem: Extracted metadata missing fields.

Cause:

  • Source doesn’t provide all metadata
  • Extraction error
  • Incomplete record

Solution:

  1. Check original article
  2. Manually add missing fields
  3. Use alternative source (PubMed vs CrossRef)

Issue 4: Cannot Find Duplicate

Problem: Same paper appears twice, not detected.

Cause:

  • Different DOIs (should be rare)
  • Different titles (abbreviated, typo)
  • Different citation keys

Solution:

  • Manual search for author + year
  • Check for similar titles
  • Remove manually

Summary

Validation ensures citation quality:

Accuracy: DOIs resolve, metadata correct
Completeness: All required fields present
Consistency: Proper formatting throughout
No duplicates: Each paper cited once
Valid syntax: BibTeX compiles without errors

Always validate before final submission!

Use automated tools:

python scripts/validate_citations.py references.bib

Follow workflow:

  1. Extract metadata
  2. Validate
  3. Fix errors
  4. Re-validate
  5. Submit

Google Scholar Search Guide

Comprehensive guide to searching Google Scholar for academic papers, including advanced search operators, filtering strategies, and metadata extraction.

Overview

Google Scholar provides the most comprehensive coverage of academic literature across all disciplines:

  • Coverage: 100+ million scholarly documents
  • Scope: All academic disciplines
  • Content types: Journal articles, books, theses, conference papers, preprints, patents, court opinions
  • Citation tracking: “Cited by” links for forward citation tracking
  • Accessibility: Free to use, no account required

Search for papers containing specific terms anywhere in the document (title, abstract, full text):

CRISPR gene editing
machine learning protein folding
climate change impact agriculture
quantum computing algorithms

Tips:

  • Use specific technical terms
  • Include key acronyms and abbreviations
  • Start broad, then refine
  • Check spelling of technical terms

Use quotation marks to search for exact phrases:

"deep learning"
"CRISPR-Cas9"
"systematic review"
"randomized controlled trial"

When to use:

  • Technical terms that must appear together
  • Proper names
  • Specific methodologies
  • Exact titles

Advanced Search Operators

Find papers by specific authors:

author:LeCun
author:"Geoffrey Hinton"
author:Church synthetic biology

Variations:

  • Single last name: author:Smith
  • Full name in quotes: author:"Jane Smith"
  • Author + topic: author:Doudna CRISPR

Tips:

  • Authors may publish under different name variations
  • Try with and without middle initials
  • Consider name changes (marriage, etc.)
  • Use quotation marks for full names

Search only in article titles:

intitle:transformer
intitle:"attention mechanism"
intitle:review climate change

Use cases:

  • Finding papers specifically about a topic
  • More precise than full-text search
  • Reduces irrelevant results
  • Good for finding reviews or methods

Search within specific journals or conferences:

source:Nature
source:"Nature Communications"
source:NeurIPS
source:"Journal of Machine Learning Research"

Applications:

  • Track publications in top-tier venues
  • Find papers in specialized journals
  • Identify conference-specific work
  • Verify publication venue

Exclusion Operator

Exclude terms from results:

machine learning -survey
CRISPR -patent
climate change -news
deep learning -tutorial -review

Common exclusions:

  • -survey: Exclude survey papers
  • -review: Exclude review articles
  • -patent: Exclude patents
  • -book: Exclude books
  • -news: Exclude news articles
  • -tutorial: Exclude tutorials

OR Operator

Search for papers containing any of multiple terms:

"machine learning" OR "deep learning"
CRISPR OR "gene editing"
"climate change" OR "global warming"

Best practices:

  • OR must be uppercase
  • Combine synonyms
  • Include acronyms and spelled-out versions
  • Use with exact phrases

Use asterisk (*) as wildcard for unknown words:

"machine * learning"
"CRISPR * editing"
"* neural network"

Note: Limited wildcard support in Google Scholar compared to other databases.

Advanced Filtering

Year Range

Filter by publication year:

Using interface:

  • Click “Since [year]” on left sidebar
  • Select custom range

Using search operators:

# Not directly in search query
# Use interface or URL parameters

In script:

python scripts/search_google_scholar.py "quantum computing" \
  --year-start 2020 \
  --year-end 2024

Sorting Options

By relevance (default):

  • Google’s algorithm determines relevance
  • Considers citations, author reputation, publication venue
  • Generally good for most searches

By date:

  • Most recent papers first
  • Good for fast-moving fields
  • May miss highly cited older papers
  • Click “Sort by date” in interface

By citation count (via script):

python scripts/search_google_scholar.py "transformers" \
  --sort-by citations \
  --limit 50

Language Filtering

In interface:

  • Settings → Languages
  • Select preferred languages

Default: English and papers with English abstracts

Search Strategies

Finding Seminal Papers

Identify highly influential papers in a field:

  1. Search by topic with broad terms
  2. Sort by citations (most cited first)
  3. Look for review articles for comprehensive overviews
  4. Check publication dates for foundational vs recent work

Example:

"generative adversarial networks"
# Sort by citations
# Top results: original GAN paper (Goodfellow et al., 2014), key variants

Finding Recent Work

Stay current with latest research:

  1. Search by topic
  2. Filter to recent years (last 1-2 years)
  3. Sort by date for newest first
  4. Set up alerts for ongoing tracking

Example:

python scripts/search_google_scholar.py "AlphaFold protein structure" \
  --year-start 2023 \
  --year-end 2024 \
  --limit 50

Finding Review Articles

Get comprehensive overviews of a field:

intitle:review "machine learning"
"systematic review" CRISPR
intitle:survey "natural language processing"

Indicators:

  • “review”, “survey”, “perspective” in title
  • Often highly cited
  • Published in review journals (Nature Reviews, Trends, etc.)
  • Comprehensive reference lists

Forward citations (papers citing a key paper):

  1. Find seminal paper
  2. Click “Cited by X”
  3. See all papers that cite it
  4. Identify how field has developed

Backward citations (references in a key paper):

  1. Find recent review or important paper
  2. Check its reference list
  3. Identify foundational work
  4. Trace development of ideas

Example workflow:

# Find original transformer paper
"Attention is all you need" author:Vaswani

# Check "Cited by 120,000+"
# See evolution: BERT, GPT, T5, etc.

# Check references in original paper
# Find RNN, LSTM, attention mechanism origins

For thorough coverage (e.g., systematic reviews):

  1. Generate synonym list:

    • Main terms + alternatives
    • Acronyms + spelled out
    • US vs UK spelling
  2. Use OR operators:

    ("machine learning" OR "deep learning" OR "neural networks")
  3. Combine multiple concepts:

    ("machine learning" OR "deep learning") ("drug discovery" OR "drug development")
  4. Search without date filters initially:

    • Get total landscape
    • Filter later if too many results
  5. Export results for systematic analysis:

    python scripts/search_google_scholar.py \
      '"machine learning" OR "deep learning" drug discovery' \
      --limit 500 \
      --output comprehensive_search.json

Extracting Citation Information

From Google Scholar Results Page

Each result shows:

  • Title: Paper title (linked to full text if available)
  • Authors: Author list (often truncated)
  • Source: Journal/conference, year, publisher
  • Cited by: Number of citations + link to citing papers
  • Related articles: Link to similar papers
  • All versions: Different versions of the same paper

Export Options

Manual export:

  1. Click “Cite” under paper
  2. Select BibTeX format
  3. Copy citation

Limitations:

  • One paper at a time
  • Manual process
  • Time-consuming for many papers

Automated export (using script):

# Search and export to BibTeX
python scripts/search_google_scholar.py "quantum computing" \
  --limit 50 \
  --format bibtex \
  --output quantum_papers.bib

Metadata Available

From Google Scholar you can typically extract:

  • Title
  • Authors (may be incomplete)
  • Year
  • Source (journal/conference)
  • Citation count
  • Link to full text (when available)
  • Link to PDF (when available)

Note: Metadata quality varies:

  • Some fields may be missing
  • Author names may be incomplete
  • Need to verify with DOI lookup for accuracy

Rate Limiting and Access

Rate Limits

Google Scholar has rate limiting to prevent automated scraping:

Symptoms of rate limiting:

  • CAPTCHA challenges
  • Temporary IP blocks
  • 429 “Too Many Requests” errors

Best practices:

  1. Add delays between requests: 2-5 seconds minimum
  2. Limit query volume: Don’t search hundreds of queries rapidly
  3. Use scholarly library: Handles rate limiting automatically
  4. Rotate User-Agents: Appear as different browsers
  5. Consider proxies: For large-scale searches (use ethically)

In our scripts:

# Automatic rate limiting built in
time.sleep(random.uniform(3, 7))  # Random delay 3-7 seconds

Ethical Considerations

DO:

  • Respect rate limits
  • Use reasonable delays
  • Cache results (don’t re-query)
  • Use official APIs when available
  • Attribute data properly

DON’T:

  • Scrape aggressively
  • Use multiple IPs to bypass limits
  • Violate terms of service
  • Burden servers unnecessarily
  • Use data commercially without permission

Institutional Access

Benefits of institutional access:

  • Access to full-text PDFs through library subscriptions
  • Better download capabilities
  • Integration with library systems
  • Link resolver to full text

Setup:

  • Google Scholar → Settings → Library links
  • Add your institution
  • Links appear in search results

Tips and Best Practices

Search Optimization

  1. Start simple, then refine:

    # Too specific initially
    intitle:"deep learning" intitle:review source:Nature 2023..2024
    
    # Better approach
    deep learning review
    # Review results
    # Add intitle:, source:, year filters as needed
  2. Use multiple search strategies:

    • Keyword search
    • Author search for known experts
    • Citation chaining from key papers
    • Source search in top journals
  3. Check spelling and variations:

    • Color vs colour
    • Optimization vs optimisation
    • Tumor vs tumour
    • Try common misspellings if few results
  4. Combine operators strategically:

    # Good combination
    author:Church intitle:"synthetic biology" 2015..2024
    
    # Find reviews by specific author on topic in recent years

Result Evaluation

  1. Check citation counts:

    • High citations indicate influence
    • Recent papers may have low citations but be important
    • Citation counts vary by field
  2. Verify publication venue:

    • Peer-reviewed journals vs preprints
    • Conference proceedings
    • Book chapters
    • Technical reports
  3. Check for full text access:

    • [PDF] link on right side
    • “All X versions” may have open access version
    • Check institutional access
    • Try author’s website or ResearchGate
  4. Look for review articles:

    • Comprehensive overviews
    • Good starting point for new topics
    • Extensive reference lists

Managing Results

  1. Use citation manager integration:

    • Export to BibTeX
    • Import to Zotero, Mendeley, EndNote
    • Maintain organized library
  2. Set up alerts for ongoing research:

    • Google Scholar → Alerts
    • Get emails for new papers matching query
    • Track specific authors or topics
  3. Create collections:

    • Save papers to Google Scholar Library
    • Organize by project or topic
    • Add labels and notes
  4. Export systematically:

    # Save search results for later analysis
    python scripts/search_google_scholar.py "your topic" \
      --output topic_papers.json
    
    # Can re-process later without re-searching
    python scripts/extract_metadata.py \
      --input topic_papers.json \
      --output topic_refs.bib

Advanced Techniques

Boolean Logic Combinations

Combine multiple operators for precise searches:

# Highly cited reviews on specific topic by known authors
intitle:review "machine learning" ("drug discovery" OR "drug development")
author:Horvath OR author:Bengio 2020..2024

# Method papers excluding reviews
intitle:method "protein folding" -review -survey

# Papers in top journals only
("Nature" OR "Science" OR "Cell") CRISPR 2022..2024

Finding Open Access Papers

# Search with generic terms
machine learning

# Filter by "All versions" which often includes preprints
# Look for green [PDF] links (often open access)
# Check arXiv, bioRxiv versions

In script:

python scripts/search_google_scholar.py "topic" \
  --open-access-only \
  --output open_access_papers.json

Tracking Research Impact

For a specific paper:

  1. Find the paper
  2. Click “Cited by X”
  3. Analyze citing papers:
    • How is it being used?
    • What fields cite it?
    • Recent vs older citations?

For an author:

  1. Search author:LastName
  2. Check h-index and i10-index
  3. View citation history graph
  4. Identify most influential papers

For a topic:

  1. Search topic
  2. Sort by citations
  3. Identify seminal papers (highly cited, older)
  4. Check recent highly-cited papers (emerging important work)

Finding Preprints and Early Work

# arXiv papers
source:arxiv "deep learning"

# bioRxiv papers
source:biorxiv CRISPR

# All preprint servers
("arxiv" OR "biorxiv" OR "medrxiv") your topic

Note: Preprints are not peer-reviewed. Always check if published version exists.

Common Issues and Solutions

Too Many Results

Problem: Search returns 100,000+ results, overwhelming.

Solutions:

  1. Add more specific terms
  2. Use intitle: to search only titles
  3. Filter by recent years
  4. Add exclusions (e.g., -review)
  5. Search within specific journals

Too Few Results

Problem: Search returns 0-10 results, suspiciously few.

Solutions:

  1. Remove restrictive operators
  2. Try synonyms and related terms
  3. Check spelling
  4. Broaden year range
  5. Use OR for alternative terms

Irrelevant Results

Problem: Results don’t match intent.

Solutions:

  1. Use exact phrases with quotes
  2. Add more specific context terms
  3. Use intitle: for title-only search
  4. Exclude common irrelevant terms
  5. Combine multiple specific terms

CAPTCHA or Rate Limiting

Problem: Google Scholar shows CAPTCHA or blocks access.

Solutions:

  1. Wait several minutes before continuing
  2. Reduce query frequency
  3. Use longer delays in scripts (5-10 seconds)
  4. Switch to different IP/network
  5. Consider using institutional access

Missing Metadata

Problem: Author names, year, or venue missing from results.

Solutions:

  1. Click through to see full details
  2. Check “All versions” for better metadata
  3. Look up by DOI if available
  4. Extract metadata from CrossRef/PubMed instead
  5. Manually verify from paper PDF

Duplicate Results

Problem: Same paper appears multiple times.

Solutions:

  1. Click “All X versions” to see consolidated view
  2. Choose version with best metadata
  3. Use deduplication in post-processing:
    python scripts/format_bibtex.py results.bib \
      --deduplicate \
      --output clean_results.bib

Integration with Scripts

search_google_scholar.py Usage

Basic search:

python scripts/search_google_scholar.py "machine learning drug discovery"

With year filter:

python scripts/search_google_scholar.py "CRISPR" \
  --year-start 2020 \
  --year-end 2024 \
  --limit 100

Sort by citations:

python scripts/search_google_scholar.py "transformers" \
  --sort-by citations \
  --limit 50

Export to BibTeX:

python scripts/search_google_scholar.py "quantum computing" \
  --format bibtex \
  --output quantum.bib

Export to JSON for later processing:

python scripts/search_google_scholar.py "topic" \
  --format json \
  --output results.json

# Later: extract full metadata
python scripts/extract_metadata.py \
  --input results.json \
  --output references.bib

Batch Searching

For multiple topics:

# Create file with search queries (queries.txt)
# One query per line

# Search each query
while read query; do
  python scripts/search_google_scholar.py "$query" \
    --limit 50 \
    --output "${query// /_}.json"
  sleep 10  # Delay between queries
done < queries.txt

Summary

Google Scholar is the most comprehensive academic search engine, providing:

Broad coverage: All disciplines, 100M+ documents
Free access: No account or subscription required
Citation tracking: “Cited by” for impact analysis
Multiple formats: Articles, books, theses, patents
Full-text search: Not just abstracts

Key strategies:

  • Use advanced operators for precision
  • Combine author, title, source searches
  • Track citations for impact
  • Export systematically to citation manager
  • Respect rate limits and access policies
  • Verify metadata with CrossRef/PubMed

For biomedical research, complement with PubMed for MeSH terms and curated metadata.


Reference: Metadata_Extraction

Metadata Extraction Guide

Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.

Overview

Accurate metadata is essential for proper citations. This guide covers:

  • Identifying paper identifiers (DOI, PMID, arXiv ID)
  • Querying metadata APIs (CrossRef, PubMed, arXiv, DataCite)
  • Required BibTeX fields by entry type
  • Handling edge cases and special situations
  • Validating extracted metadata

Paper Identifiers

DOI (Digital Object Identifier)

Format: 10.XXXX/suffix

Examples:

10.1038/s41586-021-03819-2    # Nature article
10.1126/science.aam9317       # Science article
10.1016/j.cell.2023.01.001    # Cell article
10.1371/journal.pone.0123456  # PLOS ONE article

Properties:

  • Permanent identifier
  • Most reliable for metadata
  • Resolves to current location
  • Publisher-assigned

Where to find:

  • First page of article
  • Article webpage
  • CrossRef, Google Scholar, PubMed
  • Usually prominent on publisher site

PMID (PubMed ID)

Format: 8-digit number (typically)

Examples:

34265844
28445112
35476778

Properties:

  • Specific to PubMed database
  • Biomedical literature only
  • Assigned by NCBI
  • Permanent identifier

Where to find:

  • PubMed search results
  • Article page on PubMed
  • Often in article PDF footer
  • PMC (PubMed Central) pages

PMCID (PubMed Central ID)

Format: PMC followed by numbers

Examples:

PMC8287551
PMC7456789

Properties:

  • Free full-text articles in PMC
  • Subset of PubMed articles
  • Open access or author manuscripts

arXiv ID

Format: YYMM.NNNNN or archive/YYMMNNN

Examples:

2103.14030        # New format (since 2007)
2401.12345        # 2024 submission
arXiv:hep-th/9901001  # Old format

Properties:

  • Preprints (not peer-reviewed)
  • Physics, math, CS, q-bio, etc.
  • Version tracking (v1, v2, etc.)
  • Free, open access

Where to find:

  • arXiv.org
  • Often cited before publication
  • Paper PDF header

Other Identifiers

ISBN (Books):

978-0-12-345678-9
0-123-45678-9

arXiv category:

cs.LG    # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST  # Mathematics - Statistics

Metadata APIs

CrossRef API

Primary source for DOIs - Most comprehensive metadata for journal articles.

Base URL: https://api.crossref.org/works/

No API key required, but polite pool recommended:

  • Add email to User-Agent
  • Gets better service
  • No rate limits

Basic DOI Lookup

Request:

GET https://api.crossref.org/works/10.1038/s41586-021-03819-2

Response (simplified):

{
  "message": {
    "DOI": "10.1038/s41586-021-03819-2",
    "title": ["Article title here"],
    "author": [
      {"given": "John", "family": "Smith"},
      {"given": "Jane", "family": "Doe"}
    ],
    "container-title": ["Nature"],
    "volume": "595",
    "issue": "7865",
    "page": "123-128",
    "published-print": {"date-parts": [[2021, 7, 1]]},
    "publisher": "Springer Nature",
    "type": "journal-article",
    "ISSN": ["0028-0836"]
  }
}

Fields Available

Always present:

  • DOI: Digital Object Identifier
  • title: Article title (array)
  • type: Content type (journal-article, book-chapter, etc.)

Usually present:

  • author: Array of author objects
  • container-title: Journal/book title
  • published-print or published-online: Publication date
  • volume, issue, page: Publication details
  • publisher: Publisher name

Sometimes present:

  • abstract: Article abstract
  • subject: Subject categories
  • ISSN: Journal ISSN
  • ISBN: Book ISBN
  • reference: Reference list
  • is-referenced-by-count: Citation count

Content Types

CrossRef type field values:

  • journal-article: Journal articles
  • book-chapter: Book chapters
  • book: Books
  • proceedings-article: Conference papers
  • posted-content: Preprints
  • dataset: Research datasets
  • report: Technical reports
  • dissertation: Theses/dissertations

PubMed E-utilities API

Specialized for biomedical literature - Curated metadata with MeSH terms.

Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

API key recommended (free):

  • Higher rate limits
  • Better performance

PMID to Metadata

Step 1: EFetch for full record

GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
  db=pubmed&
  id=34265844&
  retmode=xml&
  api_key=YOUR_KEY

Response: XML with comprehensive metadata

Step 2: Parse XML

Key fields:

<PubmedArticle>
  <MedlineCitation>
    <PMID>34265844</PMID>
    <Article>
      <ArticleTitle>Title here</ArticleTitle>
      <AuthorList>
        <Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
      </AuthorList>
      <Journal>
        <Title>Nature</Title>
        <JournalIssue>
          <Volume>595</Volume>
          <Issue>7865</Issue>
          <PubDate><Year>2021</Year></PubDate>
        </JournalIssue>
      </Journal>
      <Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
      <Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
    </Article>
  </MedlineCitation>
  <PubmedData>
    <ArticleIdList>
      <ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
      <ArticleId IdType="pmc">PMC8287551</ArticleId>
    </ArticleIdList>
  </PubmedData>
</PubmedArticle>

Unique PubMed Fields

MeSH Terms: Controlled vocabulary

<MeshHeadingList>
  <MeshHeading>
    <DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
  </MeshHeading>
</MeshHeadingList>

Publication Types:

<PublicationTypeList>
  <PublicationType UI="D016428">Journal Article</PublicationType>
  <PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>

Grant Information:

<GrantList>
  <Grant>
    <GrantID>R01-123456</GrantID>
    <Agency>NIAID NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
</GrantList>

arXiv API

Preprints in physics, math, CS, q-bio - Free, open access.

Base URL: http://export.arxiv.org/api/query

No API key required

arXiv ID to Metadata

Request:

GET http://export.arxiv.org/api/query?id_list=2103.14030

Response: Atom XML

<entry>
  <id>http://arxiv.org/abs/2103.14030v2</id>
  <title>Highly accurate protein structure prediction with AlphaFold</title>
  <author><name>John Jumper</name></author>
  <author><name>Richard Evans</name></author>
  <published>2021-03-26T17:47:17Z</published>
  <updated>2021-07-01T16:51:46Z</updated>
  <summary>Abstract text here...</summary>
  <arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
  <category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
  <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>

Key Fields

  • id: arXiv URL
  • title: Preprint title
  • author: Author list
  • published: First version date
  • updated: Latest version date
  • summary: Abstract
  • arxiv:doi: DOI if published
  • arxiv:journal_ref: Journal reference if published
  • category: arXiv categories

Version Tracking

arXiv tracks versions:

  • v1: Initial submission
  • v2, v3, etc.: Revisions

Always check if preprint has been published in journal (use DOI if available).

DataCite API

Research datasets, software, other outputs - Assigns DOIs to non-traditional scholarly works.

Base URL: https://api.datacite.org/dois/

Similar to CrossRef but for datasets, software, code, etc.

Request:

GET https://api.datacite.org/dois/10.5281/zenodo.1234567

Response: JSON with metadata for dataset/software

Required BibTeX Fields

@article (Journal Articles)

Required:

  • author: Author names
  • title: Article title
  • journal: Journal name
  • year: Publication year

Optional but recommended:

  • volume: Volume number
  • number: Issue number
  • pages: Page range (e.g., 123—145)
  • doi: Digital Object Identifier
  • url: URL if no DOI
  • month: Publication month

Example:

@article{Smith2024,
  author  = {Smith, John and Doe, Jane},
  title   = {Novel Approach to Protein Folding},
  journal = {Nature},
  year    = {2024},
  volume  = {625},
  number  = {8001},
  pages   = {123--145},
  doi     = {10.1038/nature12345}
}

@book (Books)

Required:

  • author or editor: Author(s) or editor(s)
  • title: Book title
  • publisher: Publisher name
  • year: Publication year

Optional but recommended:

  • edition: Edition number (if not first)
  • address: Publisher location
  • isbn: ISBN
  • url: URL
  • series: Series name

Example:

@book{Kumar2021,
  author    = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
  title     = {Robbins and Cotran Pathologic Basis of Disease},
  publisher = {Elsevier},
  year      = {2021},
  edition   = {10},
  isbn      = {978-0-323-53113-9}
}

@inproceedings (Conference Papers)

Required:

  • author: Author names
  • title: Paper title
  • booktitle: Conference/proceedings name
  • year: Year

Optional but recommended:

  • pages: Page range
  • organization: Organizing body
  • publisher: Publisher
  • address: Conference location
  • month: Conference month
  • doi: DOI if available

Example:

@inproceedings{Vaswani2017,
  author    = {Vaswani, Ashish and Shazeer, Noam and others},
  title     = {Attention is All You Need},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2017},
  pages     = {5998--6008},
  volume    = {30}
}

@incollection (Book Chapters)

Required:

  • author: Chapter author(s)
  • title: Chapter title
  • booktitle: Book title
  • publisher: Publisher name
  • year: Publication year

Optional but recommended:

  • editor: Book editor(s)
  • pages: Chapter page range
  • chapter: Chapter number
  • edition: Edition
  • address: Publisher location

Example:

@incollection{Brown2020,
  author    = {Brown, Peter O. and Botstein, David},
  title     = {Exploring the New World of the Genome with {DNA} Microarrays},
  booktitle = {DNA Microarrays: A Molecular Cloning Manual},
  editor    = {Eisen, Michael B. and Brown, Patrick O.},
  publisher = {Cold Spring Harbor Laboratory Press},
  year      = {2020},
  pages     = {1--45}
}

@phdthesis (Dissertations)

Required:

  • author: Author name
  • title: Thesis title
  • school: Institution
  • year: Year

Optional:

  • type: Type (e.g., “PhD dissertation”)
  • address: Institution location
  • month: Month
  • url: URL

Example:

@phdthesis{Johnson2023,
  author = {Johnson, Mary L.},
  title  = {Novel Approaches to Cancer Immunotherapy},
  school = {Stanford University},
  year   = {2023},
  type   = {{PhD} dissertation}
}

@misc (Preprints, Software, Datasets)

Required:

  • author: Author(s)
  • title: Title
  • year: Year

For preprints, add:

  • howpublished: Repository (e.g., “bioRxiv”)
  • doi: Preprint DOI
  • note: Preprint ID

Example (preprint):

@misc{Zhang2024,
  author       = {Zhang, Yi and Chen, Li and Wang, Hui},
  title        = {Novel Therapeutic Targets in Alzheimer's Disease},
  year         = {2024},
  howpublished = {bioRxiv},
  doi          = {10.1101/2024.01.001},
  note         = {Preprint}
}

Example (software):

@misc{AlphaFold2021,
  author       = {DeepMind},
  title        = {{AlphaFold} Protein Structure Database},
  year         = {2021},
  howpublished = {Software},
  url          = {https://alphafold.ebi.ac.uk/},
  doi          = {10.5281/zenodo.5123456}
}

Extraction Workflows

From DOI

Best practice - Most reliable source:

# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2

# Multiple DOIs
python scripts/extract_metadata.py \
  --doi 10.1038/nature12345 \
  --doi 10.1126/science.abc1234 \
  --output refs.bib

Process:

  1. Query CrossRef API with DOI
  2. Parse JSON response
  3. Extract required fields
  4. Determine entry type (@article, @book, etc.)
  5. Format as BibTeX
  6. Validate completeness

From PMID

For biomedical literature:

# Single PMID
python scripts/extract_metadata.py --pmid 34265844

# Multiple PMIDs
python scripts/extract_metadata.py \
  --pmid 34265844 \
  --pmid 28445112 \
  --output refs.bib

Process:

  1. Query PubMed EFetch with PMID
  2. Parse XML response
  3. Extract metadata including MeSH terms
  4. Check for DOI in response
  5. If DOI exists, optionally query CrossRef for additional metadata
  6. Format as BibTeX

From arXiv ID

For preprints:

python scripts/extract_metadata.py --arxiv 2103.14030

Process:

  1. Query arXiv API with ID
  2. Parse Atom XML response
  3. Check for published version (DOI in response)
  4. If published: Use DOI and CrossRef
  5. If not published: Use preprint metadata
  6. Format as @misc with preprint note

Important: Always check if preprint has been published!

From URL

When you only have URL:

python scripts/extract_metadata.py \
  --url "https://www.nature.com/articles/s41586-021-03819-2"

Process:

  1. Parse URL to extract identifier
  2. Identify type (DOI, PMID, arXiv)
  3. Extract identifier from URL
  4. Query appropriate API
  5. Format as BibTeX

URL patterns:

# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2

# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844

# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf

Batch Processing

From file with mixed identifiers:

# Create file with one identifier per line
# identifiers.txt:
#   10.1038/nature12345
#   34265844
#   2103.14030
#   https://doi.org/10.1126/science.abc123

python scripts/extract_metadata.py \
  --input identifiers.txt \
  --output references.bib

Process:

  • Script auto-detects identifier type
  • Queries appropriate API
  • Combines all into single BibTeX file
  • Handles errors gracefully

Special Cases and Edge Cases

Preprints Later Published

Issue: Preprint cited, but journal version now available.

Solution:

  1. Check arXiv metadata for DOI field
  2. If DOI present, use published version
  3. Update citation to journal article
  4. Note preprint version in comments if needed

Example:

% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
  author  = {Jumper, John and Evans, Richard and others},
  title   = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
  journal = {Nature},
  year    = {2021},
  volume  = {596},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

Multiple Authors (et al.)

Issue: Many authors (10+).

BibTeX practice:

  • Include all authors if <10
  • Use “and others” for 10+
  • Or list all (journals vary)

Example:

@article{LargeCollaboration2024,
  author = {First, Author and Second, Author and Third, Author and others},
  ...
}

Author Name Variations

Issue: Authors publish under different name formats.

Standardization:

# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.

# BibTeX format (recommended)
author = {Smith, John A.}

Extraction preference:

  1. Use full name if available
  2. Include middle initial if available
  3. Format: Last, First Middle

No DOI Available

Issue: Older papers or books without DOIs.

Solutions:

  1. Use PMID if available (biomedical)
  2. Use ISBN for books
  3. Use URL to stable source
  4. Include full publication details

Example:

@article{OldPaper1995,
  author  = {Author, Name},
  title   = {Title Here},
  journal = {Journal Name},
  year    = {1995},
  volume  = {123},
  pages   = {45--67},
  url     = {https://stable-url-here},
  note    = {PMID: 12345678}
}

Conference Papers vs Journal Articles

Issue: Same work published in both.

Best practice:

  • Cite journal version if both available
  • Journal version is archival
  • Conference version for timeliness

If citing conference:

@inproceedings{Smith2024conf,
  author    = {Smith, John},
  title     = {Title},
  booktitle = {Proceedings of NeurIPS 2024},
  year      = {2024}
}

If citing journal:

@article{Smith2024journal,
  author  = {Smith, John},
  title   = {Title},
  journal = {Journal of Machine Learning Research},
  year    = {2024}
}

Book Chapters vs Edited Collections

Extract correctly:

  • Chapter: Use @incollection
  • Whole book: Use @book
  • Book editor: List in editor field
  • Chapter author: List in author field

Datasets and Software

Use @misc with appropriate fields:

@misc{DatasetName2024,
  author       = {Author, Name},
  title        = {Dataset Title},
  year         = {2024},
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.123456},
  note         = {Version 1.2}
}

Validation After Extraction

Always validate extracted metadata:

python scripts/validate_citations.py extracted_refs.bib

Check:

  • All required fields present
  • DOI resolves correctly
  • Author names formatted consistently
  • Year is reasonable (4 digits)
  • Journal/publisher names correct
  • Page ranges use — not -
  • Special characters handled properly

Best Practices

1. Prefer DOI When Available

DOIs provide:

  • Permanent identifier
  • Best metadata source
  • Publisher-verified information
  • Resolvable link

2. Verify Automatically Extracted Metadata

Spot-check:

  • Author names match publication
  • Title matches (including capitalization)
  • Year is correct
  • Journal name is complete

3. Handle Special Characters

LaTeX special characters:

  • Protect capitalization: {AlphaFold}
  • Handle accents: M{\"u}ller or use Unicode
  • Chemical formulas: H$_2$O or \ce{H2O}

4. Use Consistent Citation Keys

Convention: FirstAuthorYEARkeyword

Smith2024protein
Doe2023machine
Johnson2024cancer

5. Include DOI for Modern Papers

All papers published after ~2000 should have DOI:

doi = {10.1038/nature12345}

6. Document Source

For non-standard sources, add note:

note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}

Summary

Metadata extraction workflow:

  1. Identify: Determine identifier type (DOI, PMID, arXiv, URL)
  2. Query: Use appropriate API (CrossRef, PubMed, arXiv)
  3. Extract: Parse response for required fields
  4. Format: Create properly formatted BibTeX entry
  5. Validate: Check completeness and accuracy
  6. Verify: Spot-check critical citations

Use scripts to automate:

  • extract_metadata.py: Universal extractor
  • doi_to_bibtex.py: Quick DOI conversion
  • validate_citations.py: Verify accuracy

Always validate extracted metadata before final submission!


PubMed Search Guide

Comprehensive guide to searching PubMed for biomedical and life sciences literature, including MeSH terms, field tags, advanced search strategies, and E-utilities API usage.

Overview

PubMed is the premier database for biomedical literature:

  • Coverage: 35+ million citations
  • Scope: Biomedical and life sciences
  • Sources: MEDLINE, life science journals, online books
  • Authority: Maintained by National Library of Medicine (NLM) / NCBI
  • Access: Free, no account required
  • Updates: Daily with new citations
  • Curation: High-quality metadata, MeSH indexing

Basic Search

Simple Keyword Search

PubMed automatically maps terms to MeSH and searches multiple fields:

diabetes
CRISPR gene editing
Alzheimer's disease treatment
cancer immunotherapy

Automatic Features:

  • Automatic MeSH mapping
  • Plural/singular variants
  • Abbreviation expansion
  • Spell checking

Exact Phrase Search

Use quotation marks for exact phrases:

"CRISPR-Cas9"
"systematic review"
"randomized controlled trial"
"machine learning"

MeSH (Medical Subject Headings)

What is MeSH?

MeSH is a controlled vocabulary thesaurus for indexing biomedical literature:

  • Hierarchical structure: Organized in tree structures
  • Consistent indexing: Same concept always tagged the same way
  • Comprehensive: Covers diseases, drugs, anatomy, techniques, etc.
  • Professional curation: NLM indexers assign MeSH terms

Finding MeSH Terms

MeSH Browser: https://meshb.nlm.nih.gov/search

Example:

Search: "heart attack"
MeSH term: "Myocardial Infarction"

In PubMed:

  1. Search with keyword
  2. Check “MeSH Terms” in left sidebar
  3. Select relevant MeSH terms
  4. Add to search

Using MeSH in Searches

Basic MeSH search:

"Diabetes Mellitus"[MeSH]
"CRISPR-Cas Systems"[MeSH]
"Alzheimer Disease"[MeSH]
"Neoplasms"[MeSH]

MeSH with subheadings:

"Diabetes Mellitus/drug therapy"[MeSH]
"Neoplasms/genetics"[MeSH]
"Heart Failure/prevention and control"[MeSH]

Common subheadings:

  • /drug therapy: Drug treatment
  • /diagnosis: Diagnostic aspects
  • /genetics: Genetic aspects
  • /epidemiology: Occurrence and distribution
  • /prevention and control: Prevention methods
  • /etiology: Causes
  • /surgery: Surgical treatment
  • /metabolism: Metabolic aspects

MeSH Explosion

By default, MeSH searches include narrower terms (explosion):

"Neoplasms"[MeSH]
# Includes: Breast Neoplasms, Lung Neoplasms, etc.

Disable explosion (exact term only):

"Neoplasms"[MeSH:NoExp]

MeSH Major Topic

Search only where MeSH term is a major focus:

"Diabetes Mellitus"[MeSH Major Topic]
# Only papers where diabetes is main topic

Field Tags

Field tags specify which part of the record to search.

Common Field Tags

Title and Abstract:

cancer[Title]                    # In title only
treatment[Title/Abstract]        # In title or abstract
"machine learning"[Title/Abstract]

Author:

"Smith J"[Author]
"Doudna JA"[Author]
"Collins FS"[Author]

Author - Full Name:

"Smith, John"[Full Author Name]

Journal:

"Nature"[Journal]
"Science"[Journal]
"New England Journal of Medicine"[Journal]
"Nat Commun"[Journal]           # Abbreviated form

Publication Date:

2023[Publication Date]
2020:2024[Publication Date]      # Date range
2023/01/01:2023/12/31[Publication Date]

Date Created:

2023[Date - Create]              # When added to PubMed

Publication Type:

"Review"[Publication Type]
"Clinical Trial"[Publication Type]
"Meta-Analysis"[Publication Type]
"Randomized Controlled Trial"[Publication Type]

Language:

English[Language]
French[Language]

DOI:

10.1038/nature12345[DOI]

PMID (PubMed ID):

12345678[PMID]

Article ID:

PMC1234567[PMC]                  # PubMed Central ID

Less Common But Useful Tags

humans[MeSH Terms]               # Only human studies
animals[MeSH Terms]              # Only animal studies
"United States"[Place of Publication]
nih[Grant Number]                # NIH-funded research
"Female"[Sex]                    # Female subjects
"Aged, 80 and over"[Age]        # Elderly subjects

Boolean Operators

Combine search terms with Boolean logic.

AND

Both terms must be present (default behavior):

diabetes AND treatment
"CRISPR-Cas9" AND "gene editing"
cancer AND immunotherapy AND "clinical trial"[Publication Type]

OR

Either term must be present:

"heart attack" OR "myocardial infarction"
diabetes OR "diabetes mellitus"
CRISPR OR Cas9 OR "gene editing"

Use case: Synonyms and related terms

NOT

Exclude terms:

cancer NOT review
diabetes NOT animal
"machine learning" NOT "deep learning"

Caution: May exclude relevant papers that mention both terms.

Combining Operators

Use parentheses for complex logic:

(diabetes OR "diabetes mellitus") AND (treatment OR therapy)

("CRISPR" OR "gene editing") AND ("therapeutic" OR "therapy") 
  AND 2020:2024[Publication Date]

(cancer OR neoplasm) AND (immunotherapy OR "immune checkpoint inhibitor") 
  AND ("clinical trial"[Publication Type] OR "randomized controlled trial"[Publication Type])

Advanced Search Builder

Access: https://pubmed.ncbi.nlm.nih.gov/advanced/

Features:

  • Visual query builder
  • Add multiple query boxes
  • Select field tags from dropdowns
  • Combine with AND/OR/NOT
  • Preview results
  • Shows final query string
  • Save queries

Workflow:

  1. Add search terms in separate boxes
  2. Select field tags
  3. Choose Boolean operators
  4. Preview results
  5. Refine as needed
  6. Copy final query string
  7. Use in scripts or save

Example built query:

#1: "Diabetes Mellitus, Type 2"[MeSH]
#2: "Metformin"[MeSH]
#3: "Clinical Trial"[Publication Type]
#4: 2020:2024[Publication Date]
#5: #1 AND #2 AND #3 AND #4

Filters and Limits

Article Types

"Review"[Publication Type]
"Systematic Review"[Publication Type]
"Meta-Analysis"[Publication Type]
"Clinical Trial"[Publication Type]
"Randomized Controlled Trial"[Publication Type]
"Case Reports"[Publication Type]
"Comparative Study"[Publication Type]

Species

humans[MeSH Terms]
mice[MeSH Terms]
rats[MeSH Terms]

Sex

"Female"[MeSH Terms]
"Male"[MeSH Terms]

Age Groups

"Infant"[MeSH Terms]
"Child"[MeSH Terms]
"Adolescent"[MeSH Terms]
"Adult"[MeSH Terms]
"Aged"[MeSH Terms]
"Aged, 80 and over"[MeSH Terms]

Text Availability

free full text[Filter]           # Free full-text available

Journal Categories

"Journal Article"[Publication Type]

E-utilities API

NCBI provides programmatic access via E-utilities (Entrez Programming Utilities).

Overview

Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

Main Tools:

  • ESearch: Search and retrieve PMIDs
  • EFetch: Retrieve full records
  • ESummary: Retrieve document summaries
  • ELink: Find related articles
  • EInfo: Database statistics

No API key required, but recommended for:

  • Higher rate limits (10/sec vs 3/sec)
  • Better performance
  • Identify your project

Get API key: https://www.ncbi.nlm.nih.gov/account/

ESearch - Search PubMed

Retrieve PMIDs for a query.

Endpoint: /esearch.fcgi

Parameters:

  • db: Database (pubmed)
  • term: Search query
  • retmax: Maximum results (default 20, max 10000)
  • retstart: Starting position (for pagination)
  • sort: Sort order (relevance, pub_date, author)
  • api_key: Your API key (optional but recommended)

Example URL:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
  db=pubmed&
  term=diabetes+AND+treatment&
  retmax=100&
  retmode=json&
  api_key=YOUR_API_KEY

Response:

{
  "esearchresult": {
    "count": "250000",
    "retmax": "100",
    "idlist": ["12345678", "12345679", ...]
  }
}

EFetch - Retrieve Records

Get full metadata for PMIDs.

Endpoint: /efetch.fcgi

Parameters:

  • db: Database (pubmed)
  • id: Comma-separated PMIDs
  • retmode: Format (xml, json, text)
  • rettype: Type (abstract, medline, full)
  • api_key: Your API key

Example URL:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
  db=pubmed&
  id=12345678,12345679&
  retmode=xml&
  api_key=YOUR_API_KEY

Response: XML with complete metadata including:

  • Title
  • Authors (with affiliations)
  • Abstract
  • Journal
  • Publication date
  • DOI
  • PMID, PMCID
  • MeSH terms
  • Keywords

ESummary - Get Summaries

Lighter-weight alternative to EFetch.

Example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
  db=pubmed&
  id=12345678&
  retmode=json&
  api_key=YOUR_API_KEY

Returns: Key metadata without full abstract and details.

Find related articles or links to other databases.

Example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
  dbfrom=pubmed&
  db=pubmed&
  id=12345678&
  linkname=pubmed_pubmed_citedin

Link types:

  • pubmed_pubmed: Related articles
  • pubmed_pubmed_citedin: Papers citing this article
  • pubmed_pmc: PMC full-text versions
  • pubmed_protein: Related protein records

Rate Limiting

Without API key:

  • 3 requests per second
  • Block if exceeded

With API key:

  • 10 requests per second
  • Better for programmatic access

Best practice:

import time
time.sleep(0.34)  # ~3 requests/second
# or
time.sleep(0.11)  # ~10 requests/second with API key

API Key Usage

Get API key:

  1. Create NCBI account: https://www.ncbi.nlm.nih.gov/account/
  2. Settings → API Key Management
  3. Create new API key
  4. Copy key

Use in requests:

&api_key=YOUR_API_KEY_HERE

Store securely:

# In environment variable
export NCBI_API_KEY="your_key_here"

# In script
import os
api_key = os.getenv('NCBI_API_KEY')

Search Strategies

For systematic reviews and meta-analyses:

# 1. Identify key concepts
Concept 1: Diabetes
Concept 2: Treatment
Concept 3: Outcomes

# 2. Find MeSH terms and synonyms
Concept 1: "Diabetes Mellitus"[MeSH] OR diabetes OR diabetic
Concept 2: "Drug Therapy"[MeSH] OR treatment OR therapy OR medication
Concept 3: "Treatment Outcome"[MeSH] OR outcome OR efficacy OR effectiveness

# 3. Combine with AND
("Diabetes Mellitus"[MeSH] OR diabetes OR diabetic) 
  AND ("Drug Therapy"[MeSH] OR treatment OR therapy OR medication)
  AND ("Treatment Outcome"[MeSH] OR outcome OR efficacy OR effectiveness)

# 4. Add filters
AND 2015:2024[Publication Date]
AND ("Clinical Trial"[Publication Type] OR "Randomized Controlled Trial"[Publication Type])
AND English[Language]
AND humans[MeSH Terms]

Finding Clinical Trials

# Specific disease + clinical trials
"Alzheimer Disease"[MeSH] 
  AND ("Clinical Trial"[Publication Type] 
       OR "Randomized Controlled Trial"[Publication Type])
  AND 2020:2024[Publication Date]

# Specific drug trials
"Metformin"[MeSH] 
  AND "Diabetes Mellitus, Type 2"[MeSH]
  AND "Randomized Controlled Trial"[Publication Type]

Finding Reviews

# Systematic reviews on topic
"CRISPR-Cas Systems"[MeSH] 
  AND ("Systematic Review"[Publication Type] OR "Meta-Analysis"[Publication Type])

# Reviews in high-impact journals
cancer immunotherapy 
  AND "Review"[Publication Type]
  AND ("Nature"[Journal] OR "Science"[Journal] OR "Cell"[Journal])

Finding Recent Papers

# Papers from last year
"machine learning"[Title/Abstract] 
  AND "drug discovery"[Title/Abstract]
  AND 2024[Publication Date]

# Recent papers in specific journal
"CRISPR"[Title/Abstract] 
  AND "Nature"[Journal]
  AND 2023:2024[Publication Date]

Author Tracking

# Specific author's recent work
"Doudna JA"[Author] AND 2020:2024[Publication Date]

# Author + topic
"Church GM"[Author] AND "synthetic biology"[Title/Abstract]

High-Quality Evidence

# Meta-analyses and systematic reviews
(diabetes OR "diabetes mellitus") 
  AND (treatment OR therapy)
  AND ("Meta-Analysis"[Publication Type] OR "Systematic Review"[Publication Type])

# RCTs only
cancer immunotherapy 
  AND "Randomized Controlled Trial"[Publication Type]
  AND 2020:2024[Publication Date]

Script Integration

search_pubmed.py Usage

Basic search:

python scripts/search_pubmed.py "diabetes treatment"

With MeSH terms:

python scripts/search_pubmed.py \
  --query '"Diabetes Mellitus"[MeSH] AND "Drug Therapy"[MeSH]'

Date range filter:

python scripts/search_pubmed.py "CRISPR" \
  --date-start 2020-01-01 \
  --date-end 2024-12-31 \
  --limit 200

Publication type filter:

python scripts/search_pubmed.py "cancer immunotherapy" \
  --publication-types "Clinical Trial,Randomized Controlled Trial" \
  --limit 100

Export to BibTeX:

python scripts/search_pubmed.py "Alzheimer's disease" \
  --limit 100 \
  --format bibtex \
  --output alzheimers.bib

Complex query from file:

# Save complex query in query.txt
cat > query.txt << 'EOF'
("Diabetes Mellitus, Type 2"[MeSH] OR "diabetes"[Title/Abstract])
AND ("Metformin"[MeSH] OR "metformin"[Title/Abstract])
AND "Randomized Controlled Trial"[Publication Type]
AND 2015:2024[Publication Date]
AND English[Language]
EOF

# Run search
python scripts/search_pubmed.py --query-file query.txt --limit 500

Batch Searches

# Search multiple topics
TOPICS=("diabetes treatment" "cancer immunotherapy" "CRISPR gene editing")

for topic in "${TOPICS[@]}"; do
  python scripts/search_pubmed.py "$topic" \
    --limit 100 \
    --output "${topic// /_}.json"
  sleep 1
done

Extract Metadata

# Search returns PMIDs
python scripts/search_pubmed.py "topic" --output results.json

# Extract full metadata
python scripts/extract_metadata.py \
  --input results.json \
  --output references.bib

Tips and Best Practices

Search Construction

  1. Start with MeSH terms:

    • Use MeSH Browser to find correct terms
    • More precise than keyword search
    • Captures all papers on topic regardless of terminology
  2. Include text word variants:

    # Better coverage
    ("Diabetes Mellitus"[MeSH] OR diabetes OR diabetic)
  3. Use field tags appropriately:

    • [MeSH] for standardized concepts
    • [Title/Abstract] for specific terms
    • [Author] for known authors
    • [Journal] for specific venues
  4. Build incrementally:

    # Step 1: Basic search
    diabetes
    
    # Step 2: Add specificity
    "Diabetes Mellitus, Type 2"[MeSH]
    
    # Step 3: Add treatment
    "Diabetes Mellitus, Type 2"[MeSH] AND "Metformin"[MeSH]
    
    # Step 4: Add study type
    "Diabetes Mellitus, Type 2"[MeSH] AND "Metformin"[MeSH] 
      AND "Clinical Trial"[Publication Type]
    
    # Step 5: Add date range
    ... AND 2020:2024[Publication Date]

Optimizing Results

  1. Too many results: Add filters

    • Restrict publication type
    • Narrow date range
    • Add more specific MeSH terms
    • Use Major Topic: [MeSH Major Topic]
  2. Too few results: Broaden search

    • Remove restrictive filters
    • Use OR for synonyms
    • Expand date range
    • Use MeSH explosion (default)
  3. Irrelevant results: Refine terms

    • Use more specific MeSH terms
    • Add exclusions with NOT
    • Use Title field instead of all fields
    • Add MeSH subheadings

Quality Control

  1. Document search strategy:

    • Save exact query string
    • Record search date
    • Note number of results
    • Save filters used
  2. Export systematically:

    • Use consistent file naming
    • Export to JSON for flexibility
    • Convert to BibTeX as needed
    • Keep original search results
  3. Validate retrieved citations:

    python scripts/validate_citations.py pubmed_results.bib

Staying Current

  1. Set up search alerts:

    • PubMed → Save search
    • Receive email updates
    • Daily, weekly, or monthly
  2. Track specific journals:

    "Nature"[Journal] AND CRISPR[Title]
  3. Follow key authors:

    "Church GM"[Author]

Common Issues and Solutions

Issue: MeSH Term Not Found

Solution:

  • Check spelling
  • Use MeSH Browser
  • Try related terms
  • Use text word search as fallback

Issue: Zero Results

Solution:

  • Remove filters
  • Check query syntax
  • Use OR for broader search
  • Try synonyms

Issue: Poor Quality Results

Solution:

  • Add publication type filters
  • Restrict to recent years
  • Use MeSH Major Topic
  • Filter by journal quality

Issue: Duplicates from Different Sources

Solution:

python scripts/format_bibtex.py results.bib \
  --deduplicate \
  --output clean.bib

Issue: API Rate Limiting

Solution:

  • Get API key (increases limit to 10/sec)
  • Add delays in scripts
  • Process in batches
  • Use off-peak hours

Summary

PubMed provides authoritative biomedical literature search:

Curated content: MeSH indexing, quality control
Precise search: Field tags, MeSH terms, filters
Programmatic access: E-utilities API
Free access: No subscription required
Comprehensive: 35M+ citations, daily updates

Key strategies:

  • Use MeSH terms for precise searching
  • Combine with text words for comprehensive coverage
  • Apply appropriate field tags
  • Filter by publication type and date
  • Use E-utilities API for automation
  • Document search strategy for reproducibility

For broader coverage across disciplines, complement with Google Scholar.

#citation #management

数据统计

总访客 -- 总访问 --
ESC
输入关键词开始搜索