Home / News / A Pragmatic Look at Web Scraping, Open Source, and LLM-Assisted Development

A Pragmatic Look at Web Scraping, Open Source, and LLM-Assisted Development

This is a tour of a education-purposed project to version control metadata forked from the models.dev project. It’s a different flavor of the original project, serving as an example for this post to cover some of the ethical topics related to creators, AI labs, and LLM-assisted development.

Building Project MERLN

merln
merln

MERLN is short for Metadata Enriched RSS Linker Node. This is a whimsical name for essentially an RSS reader that:

  1. Collects links from multiple RSS feeds
  2. Enriches each link with metadata such as titles, descriptions, and more from Open Graph tags
  3. Redirects visitors to the original content

It’s a “polite” web scraper that fetches only predefined feeds. Then, it checks robots.txt files, collects links, enriches them with metadata, validates the schema for full stack type safety, audits the dependency supply chain, and compiles a static webpage. It’s lightweight and nothing fancy, hosted on CloudFlare. (Link to the source is at the end of this post.)

Architecture

                       
                       MERLN: Metadata Enriched RSS Linker Node
                       ========================================

    [START] --> [RSS Config] --> [Feed Cache] --> [RSS Fetch] --> [robots.txt]
                  (rss.toml)      (ETag/304)       (rss.ts)         (check)
                                       |               |               |
                                       v               v               v
                                  [Cache Hit]     [HTTP Retry]    [Disallowed]
                                     (skip)        (429/503)        (flag)
                                                       |
                                                       v
                                                 [XML Parser] --> [OpenGraph]
                                                  (Atom/RSS)      (meta tags)
                                                       |               |
                                                       v               v
                                                 [TOML Files]    [Enrich Meta]
                                                 (providers/)     (og:[N])
                                                       |               |
                                                       v---------------+
                                                       |
                                  +------------+-------+-------+
                                  |            |               |
                                  v            v               v
                           [Unit Tests] [Schema Valid]  [License Audit]
                           (test.yml)   (validate.yml)  (license-check.yml)
                                  |            |               |
                                  +------------+-------+-------+
                                                       |
                                                       v
                                                [Build & Deploy]
                                                 (deploy.yml)
                                                       |
                                                       v
                                                 [Static Site]
                                                (natepapes.com)
                                                       |
                                                       v
                                                  [API JSON]
                                                  (/api.json)

Collecting Links & Enriching with Metadata

My ingestion script makes a single GET request to each external URL, parsing only the tags that publishers add specifically to be scraped. It never crawls hidden pages, checks robots.txt, throttles requests, uses a cache based on etag and last-modified request headers, and sets a user-agent that content providers can easily track.

If they see User-Agent: merln/rss (+https://natepapes.com) it’s me!

My intent is clear, and RSS feeds and Open Graph tags exist to be re-shared. There’s no deep crawling. Source URLs are stored verbatim in every TOML record under providers/*/content/ in the source code that the public can observe.

But, checking the robots.txt file for GitHub and YouTube, I later found out there were Disallow rules for the public RSS feeds. This seemed odd to me as these were protocols for resharing purposes and for reading content in a third-party app without all the ads and other distractions.

I’m a little torn on the ethics of this. That’s because I’m not building a search engine but rather collecting links to the content I created from multiple different platforms that link back to their platforms. And it’s only my content.

From my understanding, this violates the terms of the services, and to be able to collect links about your content, you’ll have to contact the content providers. Although there is no law saying you have to follow the robots.txt rules, there could be consequences if you break the terms and services.

So, for Merln, I have a robots.txt parser to see if a feed is allowed or not. If not, then I’ll have to obtain my feeds using OAuth and their public APIs, which are not the easiest thing to do. I could also create my own manual entries… it just can’t be in a script.

Ethics is good if your read the terms and services and follow the ruleset in the robots.txt file.

Forking an MIT licensed codebase

MIT is among the most permissive licenses: copy, modify, redistribute commercially as long as the license notice remains.

The README in merln links back to the original repo. License carried forward. The fork ships with the unchanged MIT license, and there’s opportunity to upstream any improvements. All modifications remain open source, so anyone can audit or build upon them. With attribution intact and the license unchanged, this action is unequivocally ethical.

I also include a license checker that will fail the GitHub workflow, indicating I am not respecting a non-permissive license, therefore, I must change my package dependencies such that the whole software supply chain conforms to a strong ethical standpoint.

Using an LLM to Assist in Development

I was blown away by my first interaction with LLMs and I wrote about it, and HN commenters had feedback. The critical piece being opaque fair use of copyright content being feed into the LLMs and shortcomings of the technology at that time.

Creative intellectuals with novel works are a key component in the application of immensely powerful LLMs following the conception of the transformer architecture in the research paper, Attention is all you need. But as it turns out, you also need an enormous amount of compute power and intellectual works.

The gray area: Opaque training data. We don’t know every snippet the model ingested. Risk of verbatim regurgitation. The model might output copyrighted code. Authorship clarity. Mixing machine and human work needs documentation.

Safeguards I apply: Prompt for patterns and use it more as a debugging and educational tool. Keep human PR reviews to vet every line. License checks. Skip those checks and the ethical needle swings away from good.

Closing the Loop

Many knowledge workers are using LLMs to assist in their day-to-day work. I view the internet as humanities braintrust.

So, ethical LLM usage… I see it as a spectrum guided by your intent. Education on fair use is the first step in thinking about your stance pertaining to ethics. And ultimately, follow a mental checklist to audit every LLM token you include in your professional practice and works.

github.com/papes1ns/merln

The post A Pragmatic Look at Web Scraping, Open Source, and LLM-Assisted Development appeared first on Atomic Spin.

Tagged: