Crawl social media profiles

The Driver plugin is responsible for managing drivers and establishing the crawl process for social media platforms.

Directory Structure¶

.
├── driver_plugin                    
│   ├── manage
│   |    └── driver_manage.py
│   | 
│   ├── operators
│   │   └── social_media_profiles
│   │   |    ├── newcomment.py
│   │   |    ├── friend.py
│   │   |    ├── profile.py
|   |   |    ├── profile_replies.py
│   │   |    ├── publication.py
│   │   |    └── stories.py
│   |   │

Operators¶

Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.

1. `ProfileOperator`¶

Responsible for collecting and updating profile information.

Workflow:

Check Pseudo Existence:
Verifies if the profile exists using the driver's pseudo_exists method.
Updates API quota if using an API-based tool.
If the profile exists:
- Checks if it's private and updates the profile document accordingly.
- Toggles associated jobs (enables/disables) based on existence.
If the profile doesn't exist:
- Deletes related jobs.
Crawl Profile:
Retrieves user info using the driver's get_user_info method.
Updates the profile in MongoDB with details like description, name, followers, and followings.
Prepares media (cover and profile pictures) for the profile.
Updates API quota if using an API-based tool.
Error Handling:
Raises ProfileCrawlException or ToggleProfileJobError if errors occur during crawling or job toggling.

2. `FriendOperator`¶

Collects followers and followings for a given profile.

Workflow:

Extracts job parameters like pseudo, profile_id, and crawl limits (nb_followers, nb_followings).
Uses the driver's get_followers_followings method to retrieve followers and followings.
For each follower/following:
Inserts the friend into MongoDB using FriendHook.
Prepares media (profile picture) for the friend.
Logs the number of followers and followings crawled.
Raises FriendCrawlException if an error occurs during crawling.

3. `PublicationOperator`¶

Collects posts and associated comments for a given profile.

Workflow:

Extracts job parameters like pseudo, profile_id, last_date, and crawl limits (nb_publications, nb_comments).
Retrieves publications and comments using the driver's get_publications_from_search method.
For each publication:
Saves the publication to MongoDB using PublicationHook.
Prepares media (images and videos) for the publication.
Processes associated comments using CommentHook, saving them to MongoDB and preparing media (author picture, comment images).
Prepares a new_comment_job for recent publications.
Updates API quota if using an API-based tool.
Enforces a time limit (16,000 seconds) to prevent excessive runtime.
Logs the number of publications crawled.

4. `StoriesOperator`¶

Collects stories for a given profile.

Workflow:

Extracts job parameters like pseudo and profile_id.
Retrieves stories using the driver's get_stories method.
For each story:
Marks the story with the profile's pseudo as the author.
Saves the story to MongoDB using PublicationHook.
Prepares media (images and videos) for the story.
Logs errors if crawling fails.

5. `CommentOperator`¶

Collects new comments for a given post.

Workflow:

Extracts job parameters like pseudo, profile_id, publication_id, publication_url, last_date, and crawl_limit.
Retrieves comments using the driver's get_comments_from_post method, filtering by since timestamp.
For each comment:
Saves the comment to MongoDB using CommentHook.
Prepares media (author picture, comment images) for the comment.
Updates API quota if using an API-based tool.
Logs the number of comments crawled.
Raises an exception if crawling fails.

6. `ProfileRepliesOperator`¶

Collects comments made by a profile on various publications.

Workflow:

Extracts job parameters like pseudo, profile_id, last_date, and crawl_limit.
Retrieves profile comments using the driver's get_profiles_replies method, filtering by since timestamp.
For each comment:
Associates the comment with the publication URL and marks the author as the profile's pseudo.
Saves the comment to MongoDB in the profile_reply collection, skipping duplicates.
Prepares media (images and videos) for the comment.
Updates API quota if using an API-based tool.
Enforces a time limit (16,000 seconds) to prevent excessive runtime.
Logs the number of publications crawled.

Crawl social media profiles

Social Media Driver plugin¶

Directory Structure¶

Operators¶

1. ProfileOperator¶

2. FriendOperator¶

3. PublicationOperator¶

4. StoriesOperator¶

5. CommentOperator¶

6. ProfileRepliesOperator¶

1. `ProfileOperator`¶

2. `FriendOperator`¶

3. `PublicationOperator`¶

4. `StoriesOperator`¶

5. `CommentOperator`¶

6. `ProfileRepliesOperator`¶