BaseSocialOperator¶
Overview¶
The BaseSocialOperator is a foundational Airflow operator designed for crawling and processing social media data.
It abstracts common tasks such as driver setup, job execution management, and cleanup logic.
This operator standardizes workflows across platforms and ensures that authentication, job state updates, and error handling are consistently managed.
It works in conjunction with the BaseDAGSocialMediaProfiles DAG, which handles DAG-level operations such as context logging and XCom cleanup.
Workflow¶
The BaseSocialOperator typically follows this process:
- Extract parameters and context from Airflow's execution context.
- Initialize the web driver (e.g., Selenium) and authentication credentials.
- Update job state to
"running". - Execute job-specific crawling (via the
crawl(job)method). - Update job state to
"pending"if successful, or"failed"upon error. - Retrieve and update cookies if authentication was used.
- Close the driver session and perform cleanup.
- Finalize the task by cleaning up DAG mappings and writing logs.
Key Components¶
1. Parameter Parsing¶
- Parameters are accessed through
context["params"]and may include: jobs: The list of crawl jobs to process.media_id,used_tool,platform, etc.- The operator uses this metadata to determine the logic for crawling, authentication, and job processing.
2. Driver Initialization¶
The _initialize_driver() method handles:
- Choosing between API-based or Selenium-based crawling.
- Authenticating via the
DAuthenticatorHook, which retrieves account details, cookies, and credentials. - Creating a driver session via
DriverManageif Selenium is required.
If any step fails, a
DriverInitExceptionis raised to prevent faulty execution.
3. Job Processing¶
The _process_jobs() method is responsible for:
- Iterating through each job.
- Setting job state to
"running"before execution. - Calling the
crawl(job)method (to be implemented in subclasses). - On success:
- Sets job state to
"pending"for future re-execution. - On failure:
- Logs the error.
- Marks job state as
"failed"for monitoring.
4. Finalization and Cleanup¶
After jobs are processed:
- If login was performed:
- Cookies are retrieved and optionally updated.
- The
_close()method is called to terminate the driver session safely. - The
finish_task()method: - Updates stored cookies (if needed).
- Deletes DAG run mappings from the authentication system.
- Writes final logs.
Integration with Base DAG¶
The BaseDAGSocialMediaProfiles class provides DAG-level utilities for social crawling pipelines:
print_context(ds=None, **kwargs)¶
- Logs all parameters passed into the DAG.
- Pushes account information to XCom using the DAG ID as the key.
- Helps with debugging and tracing.
cleanup_xcom(context)¶
- Cleans up old XCom entries associated with the DAG.
- Ensures clean runs for each DAG execution cycle.