A project that is designed to scrape the internet to create a visualization of the internet.
Go to file
2024-05-01 01:42:33 -04:00
static Add diagram and MTIWP spec 2024-05-01 01:31:33 -04:00
.gitignore Update gitignore 2024-04-30 22:46:53 -04:00
LICENSE Initial commit 2024-05-01 01:11:33 +00:00
README.md Change formatting for header 2024-05-01 01:42:33 -04:00

Map the Internet (MTI)

A project that is designed to scrape the internet to create a visualization of the internet.

How does it work?

A diagram depicting how the workers communicate with the server

The process of mapping the internet is fairly simple:

  1. The server assigns a worker a domain/list of paths to index.
  2. The worker goes through each page of respecting robots.txt.
  3. The worker makes a list of all a tags which link to an external domain.
  4. The list is sent to the server which compares each external domain to its database.

The process operates in 3 stages:

  1. Initial Stage/Exponential Growth - The number of new domains is much greater than the existing domains.
  2. Saturation Stage - The number of new domains is lower than the amount of existing domains.
  3. Rescanning Stage - Existing domains are periodically rescanned looking for new links.

MTI Worker Protocol (MTIWP) Specification

MTIWP is a TCP-based binary application layer protocol used to communicate between the workers and coordination server.

The packet is made up of the following fields:

  • Header (19 Bytes)
    • Version (2 Bytes)
    • Worker ID (6 Bytes)
    • Timestamp (8 Bytes)
    • Method (1 Byte)
    • Payload Length (2 Bytes)
  • Payload (0-65535 Bytes)

Valid Methods:

  • ACK (0x00)
  • Ping (0x01)
  • Pong (0x02)
  • Hello (0x03)
  • Index (0x04)
  • Cancel (0x05)
  • Summary (0x06)

Further Detail Goes Here