Web to Markdown
Web-to-Markdown
The purpose of this project is to be able to convert the information from any website into a Markdown bucket.
Usage
This website has a UI with a text field in the middle. Enter the base URL of the website you want to extract information from and it will automatically find all the attached links to it and also extract information from them.
To extract this information we get the HTML using the library trafilatura and parse it using BeautifulSoup and html2text.
Features
- Web Interface: simple and efficient to enter the root URL from which we will be extracting information from.
- Multiple URL scanning: it does not only get information from the website entered but from all the other links attached to it.
- Clear API usage: the name of the endpoints clearly reflect their usage to easily make use of it.
Requirements
Python: Python 3.11 or higher is recommended. Check your version:
python --version # or python3 --version
Python Packages: Install the required libraries via pip:
pip install -r requirements.txt
Ensure your requirements.txt file contains:
fastapi uvicorn requests pydantic requests trafilatura beautifulsoup4 lxml_html_clean html2text
Installation and setup
Clone the Repository (Optional):
git clone <repository_url> cd <repository_directory>
(Skip this if you have the files directly.)
Create
index.html
: Make sure theindex.html
file is in the same directory asmain.py
.Run the Application with Uvicorn:
uvicorn main:app --reload --host 0.0.0.0 --port 8000
main:app
: Adjust if your Python file or app instance have different names.--reload
: Enables auto-reloading on code changes (great for development).--host 0.0.0.0
: Allows access from other devices on your network. Remove for local-only access.--port 8000
: Change if you need a different port.
Access the Application: Open your browser:
- Locally:
http://localhost:8000
- From another device:
http://<your_server_ip>:8000
- Locally:
API Documentation
This API provides two endpoints for submitting the URL entered and checking the status of the Markdown bucket, respectively.
- Endpoint:
/submit-url/
- Method:
POST
- Request:
Content-Type
:text/plain
Form field
: (Required) the base URL to extract information from.
- Response:
- Success (Status Code: 200 OK):
Content-Type
:application/json
- The response body contains the id of the processing requested.
- Error (Status Codes: 400, 500):
Content-Type
:application/json
- Body: A JSON object containing an error description:
{ "detail": "Error message describing the problem." }
- Success (Status Code: 200 OK):
- Method:
- Endpoint:
/check-status/
- Method:
GET
- Request:
Content-Type
:text/plain
Form field
:- (Required) The base URL to extract information from.
- (Required) The id of the task that has been requested previously.
- Response:
- Success (Status Code: 200 OK):
Content-Type
:application/json
- The response body contains the status of the request.
- Error (Status Codes: 400, 500):
Content-Type
:application/json
- Body: A JSON object containing an error description:
{ "detail": "Error message describing the problem." }
- Success (Status Code: 200 OK):
- Method:
Docker Deployment
Build the Docker Image
To build the Docker image, run the following command in the project directory:
docker build -t web-to-markdown .
Run the Docker Container
To run the Docker container, use the following command:
docker run -p 8000:8000 web-to-markdown
Using Docker Compose
Alternatively, you can use Docker Compose to build and run the application. Run the following command in the project directory:
docker-compose up --build
Access the Application
Once the container is running, you can access the application in your browser:
- Locally:
http://localhost:8000
- From another device:
http://<your_server_ip>:8000