A project that is designed to scrape the internet to create a visualization of the internet.
Go to file
2024-05-06 20:11:33 -04:00
static Add diagram and MTIWP spec 2024-05-01 01:31:33 -04:00
.gitignore Update gitignore 2024-04-30 22:46:53 -04:00
LICENSE Initial commit 2024-05-01 01:11:33 +00:00
README.md Update README.md 2024-05-06 20:11:33 -04:00

Map the Internet (MTI)

A project that is designed to scrape the internet to create a visualization of the internet.

How does it work?

A diagram depicting how the workers communicate with the server

The process of mapping the internet is fairly simple:

  1. The server assigns a worker a domain/list of paths to index.
  2. The worker goes through each page of respecting robots.txt.
  3. The worker makes a list of all a tags which link to an external domain.
  4. The list is sent to the server which compares each external domain to its database.

The process operates in 3 stages:

  1. Initial Stage/Exponential Growth - The number of new domains is much greater than the existing domains.
  2. Saturation Stage - The number of new domains is lower than the amount of existing domains.
  3. Rescanning Stage - Existing domains are periodically rescanned looking for new links.

Contact

The project's maintainer and primary contact is Luke Harding <luke@lukeh990.io>. Feel free to email with any questions.

MTI Worker Protocol (MTIWP) Specification

MTIWP is a TCP-based binary application layer protocol used to communicate between the workers and coordination server.

Packet Makeup

The packet is made up of the following fields:

  • Header
    • Version (2 Bytes)
    • Sender ID (6 Bytes)
    • Timestamp (8 Bytes)
    • Channel (2 Bytes)
    • Method (1 Byte)
    • Payload Length (2 Bytes)
  • Payload (0-65535 Bytes)

All fields are required to be in big endian format.

Version

The version of the MTIWP protocol. Currently the only acceptable value is 0x0001

Sender ID

The Sender ID is a value to identify the sender. These values are to be presented in the same way a ethernet MAC address is supposed to be. (ie. XX:XX:XX:XX:XX:XX).

The server will always reserve the 00:00:00:00:00:00 sender ID. The server will designate IDs to new workers in a response with a provision method.

NOTE: The sender ID is not a real MAC address. It is just formatted like one.

Timestamp

The timestamp is based on the TimestampSecondsWithFrac<f64> provided by the serde crate for Rust.

Channel

Due to the async communication nature of the server and worker. There needs to be a way for the receiver to distinguish the intended receiving task while maintaining a single TCP connection. The client must maintain one worker to handle all send/receive operations.

There are a couple of channels which are reserved:

  • Channel 0x00 - Client/Server Initialization
  • Channel 0x01 - Ping/Pong heartbeat cycle

All other channels are to be dynamically allocated. If the client is initiating a request the channel is randomly picked from the range 0x02 - 0x08f. Likewise, if the server is initiating the request, the channel must be picked from the range 0x90 - 0xff.

The initiator is responsible for maintaining a list of in use channels and freeing those not in use anymore.

Methods

Valid Methods:

  • ACK (0x00)
  • Ping (0x01)
  • Pong (0x02)
  • Hello (0x03)
  • Provision (0x04)
  • Index (0x05)
  • Cancel (0x06)
  • Summary (0x07)
  • Error (0x08)
  • Channel End (0x09)
  • Goodbye (0x0A)
ACK Method

An ACK (ACKnowledgement) is a packet designed to indicate the previous request has been received and executed but no return data is given.

Ping & Pong Methods

This packet is designed to be used to ensure the TCP connection between worker and server is always in use. The ping method is to only be used by a server. The server will send a ping approximately every 1 second the client is required to send a pong within 1 more second or the connection will be closed.

This packet has no attached payload.

Hello Method

The hello method is send by clients when first establishing a connection.

This packet has no attached payload.

Provision Method
Index Method
Cancel Method
Summary Method
Error Method
Channel End Method
Goodbye Method

Payload Length

The payload length is a 2 byte unsigned integer that is used to determine how much payload data is expected. If the receiver is unable to read enough bytes to satisfy the payload length then an error is returned to the sender.

Initialization Process

The initialization process defines how the client establishes the connection with the server. The primary packet methods are: Hello and Provision.

After the TCP stream has been initialized, the first packet will come from the server as a Hello packet on channel 0.

After processing the message, the server has 2 options, it can either reply with a provision packet or an error. The provision request will contain the 6 bytes that make up the client's assigned sender ID.

The connection is now initialized. The heartbeat cycle will start and the client will wait for a command.

Ping/Pong Heartbeat

WIP