A comprehensive web application for scraping Twitter (X) data with an intuitive UI, database integration, and advanced scraping features.
- Features
- Project Structure
- Installation
- Usage
- API Reference
- Rate Limit Management
- Database Schema
- Technology Stack
- Troubleshooting
- Contributing
- Search Tweets: Find tweets containing specific keywords
- Hashtag Tweets: Scrape tweets containing specific hashtags
- User Tweets: Collect tweets from specific Twitter users, including:
- Date Range Search: Search for tweets within a specific time period
- Job Management System: Track and monitor all scraping jobs
- Database Integration: Save all scraped tweets to a MySQL database for permanent storage
- Rate Limit Handling: Built-in rate limit tracking to avoid hitting Twitter API limits
- Pagination Support: Automatically paginates through results to collect the requested number of tweets
- Full Tweet Metadata: Captures comprehensive tweet data including:
- Reply counts
- Retweet counts
- Bookmark counts
- Hashtags
- Creation timestamps
- User information
- Modern UI: Clean, responsive interface built with Next.js and Tailwind CSS
- Job Dashboard: View all scrape jobs and their results
- Rate Limit Indicators: Visual indicators for API rate limits
twitter-scraper/
├── twitter-scraper-app/ # Next.js frontend application
│ ├── public/ # Static assets
│ ├── src/ # Application source code
│ │ ├── app/ # Pages and routes
│ │ │ ├── api/ # API routes
│ │ │ │ ├── jobs/ # Job management API
│ │ │ │ ├── scrape/ # Scraping API endpoints
│ │ │ ├── date-range/ # Date Range search page
│ │ │ ├── hashtag/ # Hashtag search page
│ │ │ ├── jobs/ # Jobs overview page
│ │ │ ├── search/ # General search page
│ │ │ ├── user/ # User tweets page
│ │ ├── components/ # React components
│ │ └── utils/ # Utility functions
├── initialize_db.py # Database initialization script
├── scraper_api.py # Python API bridge for frontend
├── db_interface.py # Database interface functions
├── tweet_scraper_service.py # Core Twitter scraping logic
└── .env # Environment variables
- Python 3.8 or higher
- Node.js 16.x or higher
- MySQL database
git clone <repository-url>
cd twitter-scraper# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install mysql-connector-python python-dotenv twikitcd twitter-scraper-app
npm installCreate a .env file in the root directory:
# Twitter Credentials
TWITTER_USERNAME=your_username
TWITTER_EMAIL=your_email
TWITTER_PASSWORD=your_password
# Database Configuration
DB_HOST=localhost
DB_USER=root
DB_PASSWORD=your_db_password
python initialize_db.pycd twitter-scraper-app
npm run devThe application will be available at http://localhost:3000
- Navigate to the "Search" page
- Enter your search query
- Select search type (Latest, Top, or Media)
- Choose the number of tweets to retrieve (1-100)
- Click "Start Scraping"
- Navigate to the "Hashtags" page
- Enter the hashtag (without the # symbol)
- Select search type (Latest or Top)
- Choose the number of tweets to retrieve
- Click "Start Scraping"
- Navigate to the "User Tweets" page
- Enter the username (without the @ symbol)
- Select tweet type (Tweets, Replies, Media, or Likes)
- Choose the number of tweets to retrieve
- Click "Start Scraping"
- Navigate to the "Date Range" page
- Enter your search query
- Select start and end dates
- Choose the number of tweets to retrieve
- Click "Start Scraping"
- Navigate to the "Jobs" page
- Browse the list of all scraping jobs
- Click "View Details" to see job details and scraped tweets
Initiates a scraping job.
Request Body:
{
"type": "SEARCH_TWEETS",
"params": {
"query": "example search",
"searchType": "Latest",
"count": 30
}
}Types:
SEARCH_TWEETS: General searchHASHTAG_TOP_TWEETS: Hashtag search (top tweets)HASHTAG_LATEST_TWEETS: Hashtag search (latest tweets)USER_TWEETS: User tweetsDATE_RANGE_TWEETS: Date range search
Response:
{
"success": true,
"result": {
"jobId": 123,
"tweetCount": 30
},
"rateLimitInfo": {
"endpoint": "SearchTimeline",
"limit": 50,
"resetMinutes": 15
}
}Gets all jobs or a specific job's details.
Query Parameters:
jobId(optional): Get details for a specific job
Response for all jobs:
{
"success": true,
"jobs": [
{
"job_id": 123,
"job_type": "SEARCH_TWEETS",
"query": "example",
"parameters": {},
"start_time": "2023-07-10T12:00:00Z",
"end_time": "2023-07-10T12:01:30Z",
"status": "COMPLETED",
"tweet_count": 30,
"created_at": "2023-07-10T12:00:00Z"
}
]
}Response for specific job:
{
"success": true,
"job": {
"job_id": 123,
"job_type": "SEARCH_TWEETS",
"query": "example",
"parameters": {},
"start_time": "2023-07-10T12:00:00Z",
"end_time": "2023-07-10T12:01:30Z",
"status": "COMPLETED",
"tweet_count": 30,
"created_at": "2023-07-10T12:00:00Z"
},
"tweets": [
{
"id": "tweet_id",
"user_name": "username",
"user_id": "user_id",
"text": "Tweet content",
"created_at": "2023-07-09T10:00:00Z",
"reply_count": 5,
"retweet_count": 10,
"bookmark_count": 2,
"hashtags": ["example", "tweet"]
}
]
}The application implements sophisticated rate limit tracking to prevent hitting Twitter API limits:
- Automatic Tracking: Records API usage in localStorage
- Visual Indicators: Shows remaining requests and time until reset
- Form Disabling: Automatically disables forms when rate limits are reached
- Reset Countdown: Displays countdown timer until rate limits reset
| Function | Endpoint | Limit (per 15 min) |
|---|---|---|
| Search Tweets | SearchTimeline | 50 |
| Get User Tweets | UserTweets | 50 |
| Get User Replies | UserTweetsAndReplies | 50 |
| Get User Media | UserMedia | 500 |
| Get User Likes | Likes | 500 |
CREATE TABLE scraping_jobs (
job_id INT AUTO_INCREMENT PRIMARY KEY,
job_type VARCHAR(50) NOT NULL,
query VARCHAR(255) NOT NULL,
parameters JSON,
start_time DATETIME NOT NULL,
end_time DATETIME,
status VARCHAR(20) NOT NULL,
tweet_count INT DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)CREATE TABLE tweets (
id VARCHAR(255) PRIMARY KEY,
job_id INT,
user_name VARCHAR(255),
user_id VARCHAR(255),
text TEXT,
created_at DATETIME,
reply_count INT DEFAULT 0,
retweet_count INT DEFAULT 0,
bookmark_count INT DEFAULT 0,
hashtags JSON,
raw_data JSON,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (job_id) REFERENCES scraping_jobs(job_id)
)- Next.js: React framework for the UI
- Tailwind CSS: Utility-first CSS framework
- Axios: HTTP client for API requests
- React DatePicker: For date selection
- TypeScript: For type safety
- Python: Core scraping functionality
- twikit: Twitter scraping library
- MySQL: Database for storing tweets and jobs
- mysql-connector-python: Database connection
- python-dotenv: Environment variables
If you encounter authentication errors:
- Check your Twitter credentials in the
.envfile - Delete the
cookies.jsonfile (if it exists) to force re-authentication - Ensure your Twitter account is not locked or requiring additional verification
If database connection fails:
- Verify MySQL is running
- Check database credentials in the
.envfile - Run
initialize_db.pyagain to create the database and tables
If hitting rate limits:
- Wait for the rate limit to reset (15 minutes)
- Reduce the number of requests by lowering the tweet count
- Space out your scraping jobs
Common installation issues:
- MySQL Connector Error: Ensure you have the proper MySQL development libraries installed
# Ubuntu/Debian sudo apt-get install python3-dev default-libmysqlclient-dev build-essential # macOS brew install mysql-client
- Node.js Errors: Make sure you're using a compatible Node.js version (16.x or higher)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For any questions or support, please open an issue in the GitHub repository.