SDS Web Search via Perplexity API - Implementation Plan
Overview
Add a feature to search for Safety Data Sheets (SDS) across the web using Perplexity Search API when no SDS is found in our internal library. This enables users to find and attach SDS documents from manufacturer websites and SDS databases.
User Flow
Product Details Screen → SDS Missing
↓
"Search for SDS" button clicked
↓
Step 1: Search INTERNAL library first (company + global SDS)
↓
If found → Display results → User attaches SDS → Done
↓
If NOT found → Show "Search Web for SDS" option
↓
User clicks "Search Web for SDS"
↓
Step 2: Check cache table (chemiq_sds_web_search_cache)
↓
If cached results exist (< 7 days old) → Return cached results
↓
If NO cache → Call Perplexity API → Save results to cache
↓
Display results with PDF links, source URLs, titles
↓
User previews/selects an SDS
↓
System downloads PDF and uploads to our S3 → creates SDS record
↓
SDS attached to product
↓
Clean up cache for this product (optional - results served their purpose)
↓
Parse job created in chemiq_sds_parse_queue
↓
Background service picks up job → LLM parses SDS
↓
Hazard info, composition, sections populated
Cost Optimization Strategy
1. Internal Search First
Always search internal library before calling Perplexity:
- Company-mapped SDS documents
- Global SDS repository
- Only show "Search Web" if internal search returns 0 results
2. Web Search Result Caching
Cache Perplexity results to avoid repeated API calls:
- Cache Key:
company_id + product_name + manufacturer(normalized) - Cache TTL: 7 days (SDS sources don't change frequently)
- Cleanup: Delete cache entry when user imports an SDS
- Table:
chemiq_sds_web_search_cache
3. Cache Benefits
- User returns to product screen → Shows cached web results instantly
- User searches same product multiple times → No additional API cost
- Multiple users in same company search same product → Share cache
Perplexity Search API Overview
Endpoint: POST https://api.perplexity.ai/search
Authentication: Authorization: Bearer <PERPLEXITY_API_KEY>
Key Parameters:
| Parameter | Type | Description |
|---|---|---|
query | string | Search query (e.g., "Clorox Disinfecting Wipes SDS PDF safety data sheet") |
max_results | int (1-20) | Number of results |
search_domain_filter | string[] | Limit to known SDS databases |
country | string | Geographic filter (US) |
Response:
{
"results": [
{
"title": "Clorox Disinfecting Wipes Safety Data Sheet",
"url": "https://www.thecloroxcompany.com/wp-content/uploads/2024/sds-wipes.pdf",
"snippet": "SAFETY DATA SHEET - Product: Clorox Disinfecting Wipes...",
"date": "2024-01-15"
}
]
}
Pricing: Per-request (no token-based pricing)
Implementation Steps
Phase 1: Backend - Perplexity Integration Service
1.1 Create Perplexity Client Service
File: tellus-ehs-hazcom-service/app/services/external/perplexity_client.py
class PerplexityClient:
"""Client for Perplexity Search API"""
BASE_URL = "https://api.perplexity.ai/search"
# Known SDS databases to prioritize
SDS_DOMAINS = [
"msdsonline.com",
"chemicalsafety.com",
"sds.chemtel.net",
"ehs.stanford.edu",
"msdsdigital.com",
"sdsmanager.com",
"hazard.com",
# Manufacturer sites often have SDSs
]
async def search_sds(
self,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None,
max_results: int = 10
) -> List[WebSDSResult]:
"""Search for SDS documents on the web"""
# Build optimized search query
query = self._build_sds_query(product_name, manufacturer, barcode)
payload = {
"query": query,
"max_results": max_results,
"country": "US",
"search_domain_filter": self.SDS_DOMAINS # Optional: focus on SDS sites
}
response = await self._make_request(payload)
return self._parse_results(response)
def _build_sds_query(self, product_name, manufacturer, barcode):
"""Build optimized search query for SDS"""
parts = [product_name, manufacturer, "SDS", "safety data sheet", "PDF"]
if barcode:
parts.insert(0, barcode)
return " ".join(parts)
1.2 Create Web SDS Search Service
File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py
class WebSDSSearchService:
"""Service for searching SDS documents on the web"""
async def search_web_sds(
self,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None
) -> List[WebSDSSearchResult]:
"""
Search for SDS on the web via Perplexity
Returns list of potential SDS documents with URLs
"""
async def import_sds_from_url(
self,
url: str,
product_name: str,
manufacturer: str,
company_id: UUID,
user_id: UUID
) -> SDSDocument:
"""
Download SDS PDF from URL and import into our system
1. Download PDF from URL
2. Validate it's a PDF
3. Extract basic metadata
4. Upload to S3
5. Create SDS record
6. Queue for parsing
"""
1.3 Add API Endpoint
File: tellus-ehs-hazcom-service/app/api/v1/chemiq/sds.py
@router.post("/search-web", response_model=WebSDSSearchResponse)
async def search_web_for_sds(
request: WebSDSSearchRequest,
ctx: UserContext = Depends(get_user_context),
db: Session = Depends(get_db)
):
"""
Search the web for SDS documents using Perplexity API
- Searches public SDS databases and manufacturer sites
- Returns URLs to potential SDS PDFs
- User can preview and select to import
"""
@router.post("/import-from-url", response_model=SDSDocumentResponse)
async def import_sds_from_url(
request: ImportSDSFromURLRequest,
ctx: UserContext = Depends(get_user_context),
db: Session = Depends(get_db)
):
"""
Import an SDS document from a URL
- Downloads PDF from provided URL
- Validates and deduplicates
- Uploads to S3 and creates SDS record
- Attaches to product/inventory
"""
1.4 Create Schemas
File: tellus-ehs-hazcom-service/app/schemas/chemiq/web_sds_search.py
class WebSDSSearchRequest(BaseModel):
product_name: str
manufacturer: str
barcode_upc: Optional[str] = None
class WebSDSSearchResult(BaseModel):
title: str
url: str
snippet: str
source_domain: str
date: Optional[str] = None
is_pdf: bool # True if URL ends with .pdf
confidence_score: float # Our calculated relevance
class WebSDSSearchResponse(BaseModel):
results: List[WebSDSSearchResult]
total: int
search_query_used: str
class ImportSDSFromURLRequest(BaseModel):
url: str
product_name: str
manufacturer: str
revision_date: Optional[date] = None
attach_to_chemical_id: Optional[UUID] = None
attach_to_company_product_id: Optional[UUID] = None
Phase 2: Configuration & Environment
2.1 Add Environment Variables
File: .env
# Perplexity API
PERPLEXITY_API_KEY=pplx-xxxxxxxxxxxx
PERPLEXITY_ENABLED=true
2.2 Update Config
File: tellus-ehs-hazcom-service/app/core/config.py
# Perplexity API (for web SDS search)
PERPLEXITY_API_KEY: Optional[str] = None
PERPLEXITY_ENABLED: bool = False
Phase 3: Frontend - Web SDS Search UI
3.1 Add API Function
File: tellus-ehs-hazcom-ui/src/services/api/chemiq.api.ts
export interface WebSDSSearchResult {
title: string;
url: string;
snippet: string;
source_domain: string;
date?: string;
is_pdf: boolean;
confidence_score: number;
}
export interface WebSDSSearchResponse {
results: WebSDSSearchResult[];
total: number;
search_query_used: string;
}
export async function searchWebForSDS(
token: string,
userId: string,
companyId: string,
productName: string,
manufacturer: string,
barcodeUpc?: string
): Promise<WebSDSSearchResponse> {
return apiClient.post<WebSDSSearchResponse>(
'/api/v1/chemiq/sds/search-web',
{ product_name: productName, manufacturer, barcode_upc: barcodeUpc },
{ headers: { ...authHeaders(token, userId, companyId) } }
);
}
export async function importSDSFromURL(
token: string,
userId: string,
companyId: string,
request: ImportSDSFromURLRequest
): Promise<SDSDocumentResponse> {
return apiClient.post<SDSDocumentResponse>(
'/api/v1/chemiq/sds/import-from-url',
request,
{ headers: { ...authHeaders(token, userId, companyId) } }
);
}
3.2 Create WebSDSSearchModal Component
File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/WebSDSSearchModal.tsx
interface WebSDSSearchModalProps {
isOpen: boolean;
onClose: () => void;
productName: string;
manufacturer: string;
barcodeUpc?: string;
chemicalId?: string;
companyProductId?: string;
onSDSImported: (sdsId: string) => void;
}
export const WebSDSSearchModal: React.FC<WebSDSSearchModalProps> = ({...}) => {
// State for search results, loading, selected result
// Display:
// 1. Search status and query used
// 2. List of results with:
// - Title (linked to URL)
// - Source domain badge
// - Snippet preview
// - PDF indicator
// - Confidence score
// - "Preview" button (opens URL in new tab)
// - "Import & Attach" button
// 3. Import progress when user selects one
}
3.3 Update SDSInfoCard
File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/SDSInfoCard.tsx
Add "Search Web for SDS" button in the SDS Missing state:
// In SDS Missing state section
<div className="flex items-center justify-center gap-3">
<button className="btn-secondary px-4 py-2 flex items-center gap-2">
<Search className="w-4 h-4" />
Search Library
</button>
<button
onClick={() => setShowWebSearchModal(true)}
className="btn-secondary px-4 py-2 flex items-center gap-2"
>
<Globe className="w-4 h-4" />
Search Web for SDS
</button>
<button className="btn-primary px-4 py-2 flex items-center gap-2">
<Upload className="w-4 h-4" />
Upload SDS
</button>
</div>
{/* Web Search Modal */}
<WebSDSSearchModal
isOpen={showWebSearchModal}
onClose={() => setShowWebSearchModal(false)}
productName={chemical.product_name}
manufacturer={chemical.manufacturer}
barcodeUpc={chemical.barcode_upc}
chemicalId={chemical.chemical_id}
onSDSImported={handleSDSImported}
/>
3.4 Update SDSSearchSection (Add Web Search Tab)
File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/SDSSearchSection.tsx
Add a second tab or section for "Web Search" when internal search returns no results:
{searchResults.length === 0 && hasSearched && (
<div className="mt-4 p-4 bg-blue-50 rounded-lg">
<p className="text-sm text-blue-800 mb-3">
No SDS found in library. Try searching the web:
</p>
<button
onClick={() => setShowWebSearch(true)}
className="btn-secondary flex items-center gap-2"
>
<Globe className="w-4 h-4" />
Search Web for SDS
</button>
</div>
)}
Phase 4: PDF Download & Import Logic
4.1 PDF Download Utility
File: tellus-ehs-hazcom-service/app/utils/pdf_downloader.py
async def download_pdf_from_url(
url: str,
max_size_mb: int = 20,
timeout_seconds: int = 30
) -> Tuple[bytes, str, int]:
"""
Download PDF from URL
Returns: (pdf_bytes, content_type, file_size)
Raises: HTTPException on validation failure
"""
# 1. Validate URL format
# 2. Make HEAD request to check content type and size
# 3. Download with size limit
# 4. Validate it's actually a PDF (check magic bytes)
# 5. Return bytes
4.2 SDS Import Flow
URL → Download PDF → Validate PDF → Calculate SHA256 hash
→ Check for duplicates (by hash)
→ If duplicate: return existing SDS
→ If new: Upload to S3 → Create SDS record
→ Create company mapping
→ Attach to chemical/product (if specified)
→ Create parse job in chemiq_sds_parse_queue (priority=8, high)
→ Background service parses → populates hazard_info, composition, sections
4.3 Create Parse Job After Import
File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py
async def _queue_for_parsing(self, sds_id: UUID, db: Session) -> None:
"""
Queue the imported SDS for background parsing.
Creates a job in chemiq_sds_parse_queue with high priority
since this is a user-initiated import.
"""
from datetime import datetime, timezone
from uuid import uuid4
parse_job = SDSParseJob(
job_id=uuid4(),
sds_id=sds_id,
job_status='pending',
priority=8, # High priority for user-initiated imports
parse_sections=list(range(1, 17)), # All 16 sections
retry_count=0,
created_at=datetime.now(timezone.utc)
)
db.add(parse_job)
db.commit()
Phase 5: Web Search Cache Table
5.1 Create Cache Table
Migration: add_chemiq_sds_web_search_cache.py
CREATE TABLE chemiq_sds_web_search_cache (
cache_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
company_id UUID NOT NULL REFERENCES core_data_companies(company_id) ON DELETE CASCADE,
-- Search criteria (normalized for matching)
product_name_normalized VARCHAR(255) NOT NULL,
manufacturer_normalized VARCHAR(255) NOT NULL,
barcode_upc VARCHAR(100),
-- Search metadata
search_query_used TEXT NOT NULL,
searched_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
-- Cached results (JSONB array)
results JSONB NOT NULL,
result_count INTEGER NOT NULL DEFAULT 0,
-- Tracking
created_by_user_id UUID REFERENCES core_data_users(user_id),
times_accessed INTEGER NOT NULL DEFAULT 1,
last_accessed_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
-- Cleanup tracking
sds_imported BOOLEAN NOT NULL DEFAULT FALSE,
imported_sds_id UUID REFERENCES chemiq_sds_documents(sds_id),
UNIQUE(company_id, product_name_normalized, manufacturer_normalized)
);
CREATE INDEX idx_sds_web_cache_company ON chemiq_sds_web_search_cache(company_id);
CREATE INDEX idx_sds_web_cache_searched_at ON chemiq_sds_web_search_cache(searched_at);
5.2 Cache Schema for Results
class CachedWebSDSResult(BaseModel):
title: str
url: str
snippet: str
source_domain: str
date: Optional[str]
is_pdf: bool
confidence_score: float
# Stored in results JSONB column as array
5.3 Cache Service Logic
File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py
class WebSDSSearchService:
CACHE_TTL_DAYS = 7
async def search_web_sds(
self,
company_id: UUID,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None,
user_id: Optional[UUID] = None
) -> WebSDSSearchResponse:
"""
Search for SDS on the web with caching.
1. Check cache first
2. If cache hit and fresh → return cached results
3. If cache miss → call Perplexity → save to cache
"""
# Normalize for cache lookup
product_normalized = self._normalize(product_name)
manufacturer_normalized = self._normalize(manufacturer)
# Check cache
cached = await self._get_cached_results(
company_id, product_normalized, manufacturer_normalized
)
if cached and self._is_cache_fresh(cached):
# Update access tracking
await self._update_cache_access(cached.cache_id)
return self._cached_to_response(cached)
# Cache miss or stale - call Perplexity
results = await self.perplexity_client.search_sds(
product_name, manufacturer, barcode
)
# Save to cache
await self._save_to_cache(
company_id=company_id,
product_normalized=product_normalized,
manufacturer_normalized=manufacturer_normalized,
barcode=barcode,
results=results,
user_id=user_id
)
return results
def _normalize(self, text: str) -> str:
"""Normalize text for cache matching."""
return text.lower().strip()
def _is_cache_fresh(self, cached) -> bool:
"""Check if cache entry is still valid."""
age = datetime.now(timezone.utc) - cached.searched_at
return age.days < self.CACHE_TTL_DAYS
async def mark_cache_used(self, company_id: UUID, sds_id: UUID) -> None:
"""Mark cache as used when SDS is imported."""
# Update cache entry to mark sds_imported = True
# Optional: delete old cache entries
5.4 Cache Cleanup (Background Job)
Add to background service scheduler:
@scheduler.scheduled_job('cron', day='*', hour=3, minute=0)
def cleanup_stale_sds_web_cache():
"""
Daily cleanup of stale web search cache.
- Delete entries older than 30 days
- Delete entries where sds_imported = True and older than 7 days
"""
Phase 6: Rate Limiting
- Limit web searches per company: 50/day
- Limit per user: 20/day
- Track in Redis or database
File Changes Summary
New Files to Create:
Backend (tellus-ehs-hazcom-service):
app/services/external/__init__.pyapp/services/external/perplexity_client.py- Perplexity API clientapp/services/chemiq/web_sds_search_service.py- Web search + cache orchestrationapp/schemas/chemiq/web_sds_search.py- Request/response schemasapp/utils/pdf_downloader.py- PDF download utilityapp/db/models/chemiq_sds_web_cache.py- Cache table modelalembic/versions/xxx_add_chemiq_sds_web_search_cache.py- Migration
Background Service (tellus-ehs-background-service):
app/jobs/cleanup_sds_web_cache.py- Daily cache cleanup job
Frontend (tellus-ehs-hazcom-ui):
src/pages/chemiq/inventory/components/WebSDSSearchModal.tsx- Modal for web searchsrc/types/webSdsSearch.ts- TypeScript types
Files to Modify:
Backend:
app/api/v1/chemiq/sds.py- Add new endpointsapp/core/config.py- Add Perplexity config.env- Add API key
Frontend:
src/services/api/chemiq.api.ts- Add API functionssrc/services/api/index.ts- Export new functionssrc/pages/chemiq/inventory/components/SDSInfoCard.tsx- Add web search buttonsrc/pages/chemiq/inventory/components/SDSSearchSection.tsx- Add web search fallback
Security Considerations
- URL Validation: Only allow HTTPS URLs, validate against known patterns
- PDF Validation: Verify magic bytes, scan for malicious content
- Size Limits: Max 20MB per PDF download
- Rate Limiting: Prevent API abuse
- Domain Allowlist: Consider limiting to trusted SDS databases
Testing Checklist
Internal Search First:
- Internal library search runs before web search option appears
- "Search Web" button only shows when internal search returns 0 results
Caching:
- First web search calls Perplexity and saves to cache
- Second search for same product returns cached results (no API call)
- Cache expires after 7 days and triggers fresh API call
- Cache is marked as used when SDS is imported
- Background cleanup job removes stale cache entries
Perplexity Integration:
- Perplexity API integration works with valid API key
- Search returns relevant SDS results
- Domain filtering focuses on SDS databases
PDF Import:
- PDF download handles various URL formats
- Duplicate detection by hash works
- S3 upload and SDS record creation work
Parse Job:
- Parse job created in
chemiq_sds_parse_queueafter import - Background service picks up and parses imported SDS
- Hazard info, composition, sections populated after parsing
Frontend:
- Frontend modal displays results correctly
- Import flow attaches SDS to product
- Error handling for failed downloads
- Loading states during search and import
Rate Limiting:
- Rate limiting works (50/day per company, 20/day per user)