Import Features Documentation

Overview

This document describes the new import features added to the system for feeding the database with terms. The system now supports two additional import methods alongside the existing OCR and PDF text extraction:

CSV File Import - Import terms from CSV files
Web Scraping - Extract terms from websites automatically

CSV Import Features

File Format Requirements

CSV files must have the following structure:

Required Columns:

term_en - English term (required)
term_ar - Arabic term (required)

Optional Columns:

resource_page_id - ID of the resource page (optional)
confidence_level - Confidence level (0.0 to 1.0, optional)
x, y, width, height - Coordinates (optional)
status - Term status (unverified, accepted, rejected, optional)
rejection_reason - Reason for rejection (optional)
corrections - Correction notes (optional)
source_url - Source URL (optional)
source_type - Source type (optional)

CSV Template

A template CSV file can be downloaded from the import page. The template includes example data with all optional columns.

Import Options

Skip Duplicate Terms: When enabled, the system will skip terms that already exist in the database (based on English or Arabic term match).
Auto Save to Database: When enabled, imported terms are automatically saved to the database.

Import Methods

Method 1: Import Page

Navigate to Admin → Import → Import Terms to access the full import wizard with step-by-step guidance.

Method 2: Quick Import from Terms Table

Click the "Import CSV" button in the Terms table header for quick imports without leaving the terms list.

Web Scraping Features

Supported Extraction Methods

The web scraper supports multiple extraction methods:

Auto Detect - Automatically tries all methods to find terms
Extract from Tables - Extracts terms from HTML tables (ideal for glossaries)
Extract from Lists - Extracts terms from ordered/unordered/definition lists
Extract from Glossary Sections - Looks for glossary-specific HTML structures
Extract Keywords - Extracts capitalized terms and looks for nearby Arabic translations

URL Configuration

Enter the full URL of the webpage containing terms
Use the "Test URL" button to verify accessibility before scraping
The scraper uses a standard browser user-agent to avoid blocking

Scraping Configuration

Custom Selectors

Table Selector: Custom CSS selector for tables (e.g., table.glossary, #terms-table)
List Selector: Custom CSS selector for lists (e.g., ul.terms, #glossary-list)

Options

Skip Duplicate Terms: Skip terms that already exist in the database
Auto Save to Database: Automatically save scraped terms to database

How It Works

Fetch Content: The scraper downloads the webpage HTML content
Parse HTML: Uses Symfony DomCrawler to parse and navigate the HTML
Extract Terms: Applies the selected extraction method to find term pairs
Validate & Save: Validates extracted terms and saves them to database

Usage Examples

Example 1: Importing from CSV

# Sample CSV content
term_en,term_ar,confidence_level,status
Artificial Intelligence,الذكاء الاصطناعي,0.9,unverified
Machine Learning,التعلم الآلي,0.8,unverified

Example 2: Web Scraping a Glossary Page

Navigate to Import Terms page
Select Web Scraping as import method
Enter URL: https://example.com/glossary
Select Auto Detect extraction method
Click Start Import

Example 3: Quick CSV Import

Go to Terms list page
Click Import CSV button in table header
Select CSV file
Configure options
Click Import

Error Handling

CSV Import Errors

The system provides detailed error reporting:

Row-by-row error messages
Validation errors (missing required columns, invalid data)
Duplicate detection warnings
Import summary with success/failure counts

Web Scraping Errors

URL accessibility errors
HTML parsing errors
No terms found warnings
Network timeout handling

Database Schema Updates

Two new fields have been added to the terms table:

source_url (string, nullable) - URL where the term was sourced from
source_type (string, nullable) - Type of source (csv_import, web_scrape, pdf_ocr, etc.)

Best Practices

For CSV Imports

Always use the template for the correct column structure
Validate data before importing (no empty required fields)
Use consistent formatting for optional fields
Consider enabling "Skip Duplicates" for large imports

For Web Scraping

Test URLs before full scraping
Use specific selectors for better accuracy
Start with "Auto Detect" and refine if needed
Review scraped terms before bulk imports

Troubleshooting

Common Issues

CSV Import Fails
- Check file encoding (UTF-8 recommended)
- Verify required columns exist
- Ensure file size is under 10MB
Web Scraping Returns No Terms
- Test URL accessibility first
- Try different extraction methods
- Check if page requires JavaScript (scraper doesn't execute JS)
- Verify the page contains English-Arabic term pairs
Permission Errors
- Ensure storage directory is writable
- Check network permissions for web scraping

Security Considerations

File Uploads: CSV files are validated for type and size
URL Validation: Web scraping URLs are validated before fetching
Content Sanitization: All imported data is sanitized and validated
Rate Limiting: Consider implementing rate limiting for web scraping

Performance Tips

Large CSV Files: Process in chunks for files with thousands of rows
Web Scraping: Use specific selectors for faster extraction
Database: Index frequently searched columns (term_en, term_ar)

API Integration

The import features can be extended with API endpoints for:

Programmatic CSV imports
Scheduled web scraping
Third-party integration

Support

For issues with import features:

Check the error messages in the import summary
Review the application logs
Verify file formats and URL accessibility
Contact system administrator for persistent issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Features Documentation

Overview

CSV Import Features

File Format Requirements

CSV Template

Import Options

Import Methods

Method 1: Import Page

Method 2: Quick Import from Terms Table

Web Scraping Features

Supported Extraction Methods

URL Configuration

Scraping Configuration

Custom Selectors

Options

How It Works

Usage Examples

Example 1: Importing from CSV

Example 2: Web Scraping a Glossary Page

Example 3: Quick CSV Import

Error Handling

CSV Import Errors

Web Scraping Errors

Database Schema Updates

Best Practices

For CSV Imports

For Web Scraping

Troubleshooting

Common Issues

Security Considerations

Performance Tips

API Integration

Support

FilesExpand file tree

IMPORT_FEATURES.md

Latest commit

History

IMPORT_FEATURES.md

File metadata and controls

Import Features Documentation

Overview

CSV Import Features

File Format Requirements

CSV Template

Import Options

Import Methods

Method 1: Import Page

Method 2: Quick Import from Terms Table

Web Scraping Features

Supported Extraction Methods

URL Configuration

Scraping Configuration

Custom Selectors

Options

How It Works

Usage Examples

Example 1: Importing from CSV

Example 2: Web Scraping a Glossary Page

Example 3: Quick CSV Import

Error Handling

CSV Import Errors

Web Scraping Errors

Database Schema Updates

Best Practices

For CSV Imports

For Web Scraping

Troubleshooting

Common Issues

Security Considerations

Performance Tips

API Integration

Support