Skip to main content

Create Crawl Request

Start a new crawl request.

Endpoint: POST /api/v1/core/crawl-requests/

Request Examples

from watercrawl import WaterCrawlAPIClient

# Initialize client
client = WaterCrawlAPIClient('your_api_key')

# Start a new crawl
crawl = client.create_crawl_request(
url="https://example.com",
spider_options={
"max_depth": 2,
"page_limit": 100,
"allowed_domains": ["example.com"],
"exclude_paths": ["/private/*"],
"include_paths": ["/blog/*"]
},
page_options={
"exclude_tags": ["nav", "footer", "aside"],
"include_tags": ["article", "main"],
"wait_time": 100,
"include_html": False,
"only_main_content": True,
"include_links": False
}
)

print(f"Crawl started with ID: {crawl['uuid']}")

Options

Spider Options

OptionTypeDescription
max_depthintegerMaximum depth to crawl (default: 1)
page_limitintegerMaximum number of pages to crawl (default: 1)
allowed_domainsarrayList of domains to crawl (support star pattern *.example.com)
exclude_pathsarrayURL patterns to exclude (support star pattern blog/*)
include_pathsarrayURL patterns to include (support star pattern blog/*)

Page Options

OptionTypeDescription
exclude_tagsarrayHTML tags to exclude from content
include_tagsarrayHTML tags to include in content
wait_timeintegerTime to wait before extracting content
include_htmlbooleanInclude HTML in the extracted content
only_main_contentbooleanExtract only the main content
include_linksbooleanInclude links in the extracted content

Response Examples

{
'uuid': '123e4567-e89b-12d3-a456-426614174000',
'url': 'https://example.com',
'status': 'new',
'options': {
'spider_options': {
'max_depth': 2,
'page_limit': 100,
'allowed_domains': ['example.com'],
'exclude_paths': ['/private/*'],
'include_paths': ['/blog/*']
},
'page_options': {
'exclude_tags': ['nav', 'footer', 'aside'],
'include_tags': ['article', 'main'],
'wait_time': 100,
'include_html': False,
'only_main_content': True,
'include_links': False
}
},
'created_at': '2024-01-01T00:00:00Z',
'updated_at': '2024-01-01T00:00:00Z',
'number_of_documents': '0'
}