Retrieve profiles websites

Retrieve Profiles & Websites DAG¶

The retrieve_profiles_websites DAG is responsible for retrieving profiles and their associated websites from a Webserver API. It processes the retrieved data by either inserting new records or updating existing ones in the Crawlserver MongoDB database. Additionally, it creates jobs for both profiles and websites based on available tasks.

Parameters¶

The DAG accepts two parameters: - school_identifier (int): School identifier for a profile. - group_school_identifier (int): Group school identifier for a profile.

alt text

Tasks¶

The DAG consists of 4 PythonOperator tasks:

alt text

1. `validate_params`¶

Validates input parameters using validate_input_params_profile. This task ensures that either school_identifier or group_school_identifier is provided, but not both, before any processing begins.

2. `get_auth_token`¶

Retrieves an authentication token from the Webserver. This token is stored in XCom so that it can be used by subsequent tasks to authenticate API calls.

3. `process_profiles_and_jobs`¶

Calls the API to fetch profiles data.
Processes each profile by either updating an existing record or inserting a new one into the childs_profile collection.
For newly inserted profiles, it creates associated jobs (e.g., for profile crawling) based on available tasks from the core_tasks collection.

4. `process_pages_and_jobs`¶

Retrieves website (page) data from the API (using data pushed by the profiles task).
Inserts or updates website records in the website collection.
For new pwebsites, creates jobs (e.g., for publication crawling) based on available tasks.

Cleaning XCom¶

Once the DAG execution is successfully completed, all XCom entries related to the DAG run are cleaned up to maintain a tidy state.

Retrieve profiles websites