diff --git a/README.md b/README.md index 81f4f5a..3d183fa 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,39 @@ -# map-the-internet +# Map the Internet (MTI) -A project that is designed to scrape the internet to create a visualization of the internet. \ No newline at end of file +A project that is designed to scrape the internet to create a visualization of the internet. + +## How does it work? +![A diagram depicting how the workers communicate with the server](static/diagram.png?) + +The process of mapping the internet is fairly simple: +1. The server assigns a worker a domain/list of paths to index. +2. The worker goes through each page of respecting robots.txt. +3. The worker makes a list of all a tags which link to an external domain. +4. The list is sent to the server which compares each external domain to its database. + +The process operates in 3 stages: +1. Initial Stage/Exponential Growth - The number of new domains is much greater than the existing domains. +2. Saturation Stage - The number of new domains is lower than the amount of existing domains. +3. Rescanning Stage - Existing domains are periodically rescanned looking for new links. + +## MTI Worker Protocol (MTIWP) Specification +MTIWP is a TCP-based binary application layer protocol used to communicate between the workers and coordination server. + +The packet is made up of the following fields: +- Version (2 Bytes) +- Worker ID (6 Bytes) +- Timestamp (8 Bytes) +- Method (1 Byte) +- Payload Length (2 Bytes) +- Payload (0-65535 Bytes) + +Valid Methods: +- ACK (0x00) +- Ping (0x01) +- Pong (0x02) +- Hello (0x03) +- Index (0x04) +- Cancel (0x05) +- Summary (0x06) + +Further Detail Goes Here diff --git a/static/diagram.png b/static/diagram.png new file mode 100644 index 0000000..a0b0421 Binary files /dev/null and b/static/diagram.png differ