map-the-internet/README.md

1.4 KiB

Map the Internet (MTI)

A project that is designed to scrape the internet to create a visualization of the internet.

How does it work?

A diagram depicting how the workers communicate with the server

The process of mapping the internet is fairly simple:

  1. The server assigns a worker a domain/list of paths to index.
  2. The worker goes through each page of respecting robots.txt.
  3. The worker makes a list of all a tags which link to an external domain.
  4. The list is sent to the server which compares each external domain to its database.

The process operates in 3 stages:

  1. Initial Stage/Exponential Growth - The number of new domains is much greater than the existing domains.
  2. Saturation Stage - The number of new domains is lower than the amount of existing domains.
  3. Rescanning Stage - Existing domains are periodically rescanned looking for new links.

MTI Worker Protocol (MTIWP) Specification

MTIWP is a TCP-based binary application layer protocol used to communicate between the workers and coordination server.

The packet is made up of the following fields:

  • Header (19 Bytes)
    • Version (2 Bytes)
    • Worker ID (6 Bytes)
    • Timestamp (8 Bytes)
    • Method (1 Byte)
    • Payload Length (2 Bytes)
  • Payload (0-65535 Bytes)

Valid Methods:

  • ACK (0x00)
  • Ping (0x01)
  • Pong (0x02)
  • Hello (0x03)
  • Index (0x04)
  • Cancel (0x05)
  • Summary (0x06)

Further Detail Goes Here