Services Documentation
WaterCrawl consists of several services working together in a Docker Compose environment. Here's a detailed overview of each service:
Core Services
App (Django Application)
- Image:
watercrawl/watercrawl:v0.7.1
- Purpose: Main application server
- Tech Stack: Django with Gunicorn
- Default Port: 9000 (internal)
- Dependencies: PostgreSQL, Redis
- Key Features:
- REST API endpoints
- User authentication
- Crawl job management
- Plugin system
- Data processing
- Command:
gunicorn -b 0.0.0.0:9000 -w 2 watercrawl.wsgi:application --access-logfile - --error-logfile - --timeout 60
Frontend
- Image:
watercrawl/frontend:v0.7.1
- Purpose: Web interface
- Tech Stack: React/Vite
- Dependencies: App (Core API)
- Key Features:
- User interface
- Interactive dashboard
- Job management interface
- Environment Variables:
VITE_API_BASE_URL
: API endpoint URL
- Command:
npm run serve
Nginx
- Image:
nginx:alpine
- Purpose: Web server and reverse proxy
- Default Port: 80 (configurable via
NGINX_PORT
) - Dependencies: App, Frontend, MinIO
- Volumes:
./nginx/nginx.conf:/etc/nginx/conf.d/default.conf.template
./nginx/entrypoint.sh:/entrypoint.sh
- Command: Runs an entrypoint script that configures and starts Nginx
Celery (Task Queue)
- Image: Same as App (
watercrawl/watercrawl:v0.7.1
) - Purpose: Background task processing
- Dependencies: Redis, App
- Key Features:
- Asynchronous task execution
- Crawl job scheduling
- Data processing tasks
- Plugin execution
- Command:
celery -A watercrawl worker -l info -S django
Celery Beat (Scheduler)
- Image: Same as App (
watercrawl/watercrawl:v0.7.1
) - Purpose: Periodic task scheduler
- Dependencies: Redis, App, Celery
- Key Features:
- Schedule periodic tasks
- Run recurring jobs
- Command:
celery -A watercrawl beat -l info -S django
Supporting Services
PostgreSQL
- Image:
postgres:17.2-alpine3.21
- Purpose: Main database
- Default Port: 5432 (internal)
- Volumes:
./volumes/postgres-db:/var/lib/postgresql/data
- Environment Variables:
POSTGRES_PASSWORD
: Database passwordPOSTGRES_USER
: Database usernamePOSTGRES_DB
: Database name
- Health Check: Ensures database is ready before dependent services start
Redis
- Image:
redis:latest
- Purpose: Cache and message broker
- Used For:
- Celery task queue
- Django cache
- Rate limiting
- Locks
MinIO
- Image:
minio/minio:RELEASE.2024-11-07T00-52-20Z
- Purpose: Object storage (S3-compatible)
- Volumes:
./volumes/minio-data:/data
- Environment Variables:
MINIO_BROWSER_REDIRECT_URL
: URL for MinIO consoleMINIO_SERVER_URL
: URL for MinIO serverMINIO_ROOT_USER
: MinIO username (same asMINIO_ACCESS_KEY
)MINIO_ROOT_PASSWORD
: MinIO password (same asMINIO_SECRET_KEY
)
- Command:
server /data --console-address ":9001"
Playwright
- Image:
watercrawl/playwright:1.1
- Purpose: Headless browser for JavaScript rendering
- Default Port: 8000 (internal)
- Environment Variables:
AUTH_API_KEY
: API key for authenticationPORT
: Service portHOST
: Service host
Service Interaction
The services interact as follows:
-
User Flow:
- Users access the application through Nginx (port 80)
- Nginx routes requests to Frontend or App based on the URL path
- API requests are sent to the App
- Static assets are served by Frontend
-
Crawl Job Flow:
- App receives crawl/search requests from users
- App enqueues jobs to Celery via Redis
- Celery processes jobs using Scrapy or Playwright as needed
- Results are stored in PostgreSQL and file assets in MinIO
- Users can monitor job status through the Frontend
-
Storage Flow:
- Media files are stored in MinIO
- MinIO provides S3-compatible API for file operations
- Nginx proxies MinIO requests for simplified access
Scaling Considerations
When scaling the application, consider:
- Celery Workers: Add more workers for increased crawl job throughput
- PostgreSQL: Consider using a managed database service for production
- Redis: May need to scale for high-volume queues
- Storage: MinIO can be configured for cluster mode or replaced with S3
Monitoring
Monitor your services using:
# Check service status
docker compose ps
# View logs for all services
docker compose logs
# View logs for a specific service
docker compose logs app
# Follow logs in real-time
docker compose logs -f
# View resource usage
docker stats