About SortedFor.Me

Why did I do this? Cause I was sick of being bombarded with political content trolls and ads, so my goal was to clean it up so I could find more cat memes, really.

Data Ingestion Process

I've crafted SortedFor.Me to not just fetch data, but to polish it, shaping it into something meaningful and enjoyable:

  1. Data Collection (reddit_utils.py): The mission kicks off on Reddit.
    1. By leveraging OAuth, I securely authenticate and retrieve an access token to explore Reddit's depths.
    2. The hunt revolves around my favorite subreddits, where posts are selected based on both predefined and dynamically user-defined key phrases. Naturally, many revolve around cats because, well, who doesn't love them?
    3. Variety is the spice of content. Hence, I scour various listing types like new, rising, and top to ensure a diverse array of posts.
  2. Playing Nice with the API (reddit_utils.py & cache_utils.py): Navigating Reddit’s ecosystems requires finesse:
    1. Rate limiting is my ally, helping maintain my exploration within Reddit's boundaries to avoid that dreaded digital boot.
    2. A sophisticated retry mechanism is in place to gracefully dance around rate limit errors, ensuring smooth and continuous operation.
    3. Efficient caching is orchestrated via cache_utils.py, logging results and querying existing data to avoid unnecessary repeated API requests, conserving both time and resources.
  3. Exclusion Filters (reddit_utils.py): Filtering out the noise, with precision.
    1. Exclusion phrases help carve out a feed devoid of politics and ads.
    2. Thanks to focus-driven logic in reddit_utils.py, these filters are smartly applied to preserve the content's relevancy.
  4. Image Processing and OCR (ocr_utils.py): The invisible backbone.
    1. Images are more than just pixels. ocr_utils.py meticulously examines URLs, generates hashes, and identifies potential duplicates for removal.
    2. Optical Character Recognition (OCR) uncovers hidden text, while ocr_utils.py checks against exclude phrases, ensuring no unwanted content sneaks by.
    3. Automatic detection and exclusion of broken or stale image links guarantee the feed remains crisp and relevant.
  5. Duplicate Removal (cache_utils.py & reddit_utils.py): The unseen purging.
    1. Whether it's posts or images, duplicates are expertly sifted out to maintain a fresh and engaging feed.
    2. The unique logic embedded within these utilities ensures distinct and captivating content.
  6. JSON Storage: A safe harbor for data.
    1. After a rigorous clean-up, the refined data is stored in JSON files, forming the backbone of your browsing pleasure at SortedFor.Me.
    2. This structuring ensures only the cleanest and most relevant content is displayed, making your experience seamless and enjoyable.

Objective

The aim is simple: Enjoy browsing without the politics and ads. By using advanced text and image analysis, I keep the feed focused and enjoyable. It's all about bringing the fun back to scrolling through the web.

Next Steps

SortedFor.Me is still a work-in-progress (MVP), but thinking maybe something like: