This document describes the new import features added to the system for feeding the database with terms. The system now supports two additional import methods alongside the existing OCR and PDF text extraction:
- CSV File Import - Import terms from CSV files
- Web Scraping - Extract terms from websites automatically
CSV files must have the following structure:
Required Columns:
term_en- English term (required)term_ar- Arabic term (required)
Optional Columns:
resource_page_id- ID of the resource page (optional)confidence_level- Confidence level (0.0 to 1.0, optional)x,y,width,height- Coordinates (optional)status- Term status (unverified, accepted, rejected, optional)rejection_reason- Reason for rejection (optional)corrections- Correction notes (optional)source_url- Source URL (optional)source_type- Source type (optional)
A template CSV file can be downloaded from the import page. The template includes example data with all optional columns.
- Skip Duplicate Terms: When enabled, the system will skip terms that already exist in the database (based on English or Arabic term match).
- Auto Save to Database: When enabled, imported terms are automatically saved to the database.
Navigate to Admin → Import → Import Terms to access the full import wizard with step-by-step guidance.
Click the "Import CSV" button in the Terms table header for quick imports without leaving the terms list.
The web scraper supports multiple extraction methods:
- Auto Detect - Automatically tries all methods to find terms
- Extract from Tables - Extracts terms from HTML tables (ideal for glossaries)
- Extract from Lists - Extracts terms from ordered/unordered/definition lists
- Extract from Glossary Sections - Looks for glossary-specific HTML structures
- Extract Keywords - Extracts capitalized terms and looks for nearby Arabic translations
- Enter the full URL of the webpage containing terms
- Use the "Test URL" button to verify accessibility before scraping
- The scraper uses a standard browser user-agent to avoid blocking
- Table Selector: Custom CSS selector for tables (e.g.,
table.glossary,#terms-table) - List Selector: Custom CSS selector for lists (e.g.,
ul.terms,#glossary-list)
- Skip Duplicate Terms: Skip terms that already exist in the database
- Auto Save to Database: Automatically save scraped terms to database
- Fetch Content: The scraper downloads the webpage HTML content
- Parse HTML: Uses Symfony DomCrawler to parse and navigate the HTML
- Extract Terms: Applies the selected extraction method to find term pairs
- Validate & Save: Validates extracted terms and saves them to database
# Sample CSV content
term_en,term_ar,confidence_level,status
Artificial Intelligence,الذكاء الاصطناعي,0.9,unverified
Machine Learning,التعلم الآلي,0.8,unverified- Navigate to Import Terms page
- Select Web Scraping as import method
- Enter URL:
https://example.com/glossary - Select Auto Detect extraction method
- Click Start Import
- Go to Terms list page
- Click Import CSV button in table header
- Select CSV file
- Configure options
- Click Import
The system provides detailed error reporting:
- Row-by-row error messages
- Validation errors (missing required columns, invalid data)
- Duplicate detection warnings
- Import summary with success/failure counts
- URL accessibility errors
- HTML parsing errors
- No terms found warnings
- Network timeout handling
Two new fields have been added to the terms table:
source_url(string, nullable) - URL where the term was sourced fromsource_type(string, nullable) - Type of source (csv_import, web_scrape, pdf_ocr, etc.)
- Always use the template for the correct column structure
- Validate data before importing (no empty required fields)
- Use consistent formatting for optional fields
- Consider enabling "Skip Duplicates" for large imports
- Test URLs before full scraping
- Use specific selectors for better accuracy
- Start with "Auto Detect" and refine if needed
- Review scraped terms before bulk imports
-
CSV Import Fails
- Check file encoding (UTF-8 recommended)
- Verify required columns exist
- Ensure file size is under 10MB
-
Web Scraping Returns No Terms
- Test URL accessibility first
- Try different extraction methods
- Check if page requires JavaScript (scraper doesn't execute JS)
- Verify the page contains English-Arabic term pairs
-
Permission Errors
- Ensure storage directory is writable
- Check network permissions for web scraping
- File Uploads: CSV files are validated for type and size
- URL Validation: Web scraping URLs are validated before fetching
- Content Sanitization: All imported data is sanitized and validated
- Rate Limiting: Consider implementing rate limiting for web scraping
- Large CSV Files: Process in chunks for files with thousands of rows
- Web Scraping: Use specific selectors for faster extraction
- Database: Index frequently searched columns (term_en, term_ar)
The import features can be extended with API endpoints for:
- Programmatic CSV imports
- Scheduled web scraping
- Third-party integration
For issues with import features:
- Check the error messages in the import summary
- Review the application logs
- Verify file formats and URL accessibility
- Contact system administrator for persistent issues