Skip to main content

API Overview

Welcome to the WaterCrawl API documentation. This guide will help you understand and integrate with our API.

Client Libraries

pip install watercrawl-py
from watercrawl import WaterCrawlAPIClient

Authentication

All API requests require authentication using a JWT token. Include the token in the Authorization header:

from watercrawl import WaterCrawlAPIClient

# The client handles authentication automatically
client = WaterCrawlAPIClient('your_api_key')

Status Values

Crawl requests can have the following status values:

  • new: Crawl request created but not started
  • running: Crawl is in progress
  • finished: Crawl completed successfully
  • cancelling: Crawl is being cancelled
  • canceled: Crawl was cancelled
  • failed: Crawl failed due to an error

API Endpoints

  1. Scrape URL: Start a new crawl
  2. Create Crawl Request: Start a new crawl
  3. List Crawl Requests: Get all crawls
  4. Get Crawl Request: Get crawl details
  5. Cancel Crawl Request: Stop a crawl
  6. Monitor Crawl Status: Track progress
  7. List Crawl Results: Get results

Best Practices

  1. Rate Limiting: Implement appropriate rate limiting in your client applications to avoid overwhelming the target websites.
  2. Resource Management: Use the page_limit and max_depth options to control the scope of your crawls.
  3. Error Handling: Always check the status of your crawl requests and implement proper error handling.
  4. Content Extraction: Use exclude_tags and include_tags to precisely target the content you need.
  5. Domain Restrictions: Use allowed_domains to prevent the crawler from accessing unintended domains.