About SortedFor.Me
Why did I do this? Cause I was sick of being bombarded with political content trolls and ads, so my goal was to clean it up so I could find more cat memes, really.
Data Ingestion Process
I've crafted SortedFor.Me to not just fetch data, but to polish it, shaping it into something meaningful and enjoyable:
- Data Collection (
reddit_utils.py
): The mission kicks off on Reddit.- By leveraging OAuth, I securely authenticate and retrieve an access token to explore Reddit's depths.
- The hunt revolves around my favorite subreddits, where posts are selected based on both predefined and dynamically user-defined key phrases. Naturally, many revolve around cats because, well, who doesn't love them?
- Variety is the spice of content. Hence, I scour various listing types like new, rising, and top to ensure a diverse array of posts.
- Playing Nice with the API (
reddit_utils.py
&cache_utils.py
): Navigating Reddit’s ecosystems requires finesse:- Rate limiting is my ally, helping maintain my exploration within Reddit's boundaries to avoid that dreaded digital boot.
- A sophisticated retry mechanism is in place to gracefully dance around rate limit errors, ensuring smooth and continuous operation.
- Efficient caching is orchestrated via
cache_utils.py
, logging results and querying existing data to avoid unnecessary repeated API requests, conserving both time and resources.
- Exclusion Filters (
reddit_utils.py
): Filtering out the noise, with precision.- Exclusion phrases help carve out a feed devoid of politics and ads.
- Thanks to focus-driven logic in
reddit_utils.py
, these filters are smartly applied to preserve the content's relevancy.
- Image Processing and OCR (
ocr_utils.py
): The invisible backbone.- Images are more than just pixels.
ocr_utils.py
meticulously examines URLs, generates hashes, and identifies potential duplicates for removal. - Optical Character Recognition (OCR) uncovers hidden text, while
ocr_utils.py
checks against exclude phrases, ensuring no unwanted content sneaks by. - Automatic detection and exclusion of broken or stale image links guarantee the feed remains crisp and relevant.
- Images are more than just pixels.
- Duplicate Removal (
cache_utils.py
&reddit_utils.py
): The unseen purging.- Whether it's posts or images, duplicates are expertly sifted out to maintain a fresh and engaging feed.
- The unique logic embedded within these utilities ensures distinct and captivating content.
- JSON Storage: A safe harbor for data.
- After a rigorous clean-up, the refined data is stored in JSON files, forming the backbone of your browsing pleasure at SortedFor.Me.
- This structuring ensures only the cleanest and most relevant content is displayed, making your experience seamless and enjoyable.
Objective
The aim is simple: Enjoy browsing without the politics and ads. By using advanced text and image analysis, I keep the feed focused and enjoyable. It's all about bringing the fun back to scrolling through the web.
Next Steps
SortedFor.Me is still a work-in-progress (MVP), but thinking maybe something like:
- User Customization: Allowing users to directly customize data sources and exclusion phrases, giving you more control over what you want to see.
- More Data Sources: Visioning to integrate additional platforms beyond Reddit, tapping into other social media and news sites to enrich your feed with broader content diversity.