Crawl social media profiles
Social Media Driver plugin¶
The Driver plugin is responsible for managing drivers and establishing the crawl process for social media platforms.
Directory Structure¶
.
├── driver_plugin
│ ├── manage
│ | └── driver_manage.py
│ |
│ ├── operators
│ │ └── social_media_profiles
│ │ | ├── newcomment.py
│ │ | ├── friend.py
│ │ | ├── profile.py
| | | ├── profile_replies.py
│ │ | ├── publication.py
│ │ | └── stories.py
│ | │
Operators¶
Operators represent tasks in a DAG, encapsulating the logic for each step of the workflow.
1. ProfileOperator¶
Responsible for collecting and updating profile information.
Workflow:
- Check Pseudo Existence:
- Verifies if the profile exists using the driver's
pseudo_existsmethod. - Updates API quota if using an API-based tool.
- If the profile exists:
- Checks if it's private and updates the profile document accordingly.
- Toggles associated jobs (enables/disables) based on existence.
-
If the profile doesn't exist:
- Deletes related jobs.
-
Crawl Profile:
- Retrieves user info using the driver's
get_user_infomethod. - Updates the profile in MongoDB with details like description, name, followers, and followings.
- Prepares media (cover and profile pictures) for the profile.
-
Updates API quota if using an API-based tool.
-
Error Handling:
- Raises
ProfileCrawlExceptionorToggleProfileJobErrorif errors occur during crawling or job toggling.
2. FriendOperator¶
Collects followers and followings for a given profile.
Workflow:
- Extracts job parameters like
pseudo,profile_id, and crawl limits (nb_followers,nb_followings). - Uses the driver's
get_followers_followingsmethod to retrieve followers and followings. - For each follower/following:
- Inserts the friend into MongoDB using
FriendHook. - Prepares media (profile picture) for the friend.
- Logs the number of followers and followings crawled.
- Raises
FriendCrawlExceptionif an error occurs during crawling.
3. PublicationOperator¶
Collects posts and associated comments for a given profile.
Workflow:
- Extracts job parameters like
pseudo,profile_id,last_date, and crawl limits (nb_publications,nb_comments). - Retrieves publications and comments using the driver's
get_publications_from_searchmethod. - For each publication:
- Saves the publication to MongoDB using
PublicationHook. - Prepares media (images and videos) for the publication.
- Processes associated comments using
CommentHook, saving them to MongoDB and preparing media (author picture, comment images). - Prepares a
new_comment_jobfor recent publications. - Updates API quota if using an API-based tool.
- Enforces a time limit (16,000 seconds) to prevent excessive runtime.
- Logs the number of publications crawled.
4. StoriesOperator¶
Collects stories for a given profile.
Workflow:
- Extracts job parameters like
pseudoandprofile_id. - Retrieves stories using the driver's
get_storiesmethod. - For each story:
- Marks the story with the profile's pseudo as the author.
- Saves the story to MongoDB using
PublicationHook. - Prepares media (images and videos) for the story.
- Logs errors if crawling fails.
5. CommentOperator¶
Collects new comments for a given post.
Workflow:
- Extracts job parameters like
pseudo,profile_id,publication_id,publication_url,last_date, andcrawl_limit. - Retrieves comments using the driver's
get_comments_from_postmethod, filtering bysincetimestamp. - For each comment:
- Saves the comment to MongoDB using
CommentHook. - Prepares media (author picture, comment images) for the comment.
- Updates API quota if using an API-based tool.
- Logs the number of comments crawled.
- Raises an exception if crawling fails.
6. ProfileRepliesOperator¶
Collects comments made by a profile on various publications.
Workflow:
- Extracts job parameters like
pseudo,profile_id,last_date, andcrawl_limit. - Retrieves profile comments using the driver's
get_profiles_repliesmethod, filtering bysincetimestamp. - For each comment:
- Associates the comment with the publication URL and marks the author as the profile's pseudo.
- Saves the comment to MongoDB in the
profile_replycollection, skipping duplicates. - Prepares media (images and videos) for the comment.
- Updates API quota if using an API-based tool.
- Enforces a time limit (16,000 seconds) to prevent excessive runtime.
- Logs the number of publications crawled.