Language Translation Dataset

Parallel corpus dataset with sentence pairs in multiple languages

About This Dataset

This dataset contains parallel sentence pairs in multiple languages with aligned translations. Perfect for machine translation, multilingual NLP, and cross-lingual model training.

Multiple Language Pairs

Parallel sentences in multiple languages with aligned translations for comprehensive training.

Aligned Translations

Precisely aligned sentence pairs ensuring accurate translation model training.

Training & Validation Sets

Pre-split datasets ready for immediate use in machine learning pipelines.

Ready for Translation Models

Optimized format compatible with Transformers, mBERT, and mT5 models.

Technologies

TSV JSON Transformers mBERT mT5

How It Works - Complete Guide

Using the Translation Tool:
  1. Select Languages: Choose your source language (or use "Detect language" for auto-detection) and target language from the dropdown menus.
  2. Enter Text: Type or paste any word, phrase, or sentence in the left text box.
  3. Auto-Translation: Translation happens automatically as you type (with 500ms delay for better performance).
  4. View Result: The translated text appears instantly in the right box.
  5. Swap Languages: Click the swap button (↔) to reverse the translation direction.
  6. Copy Text: Use the copy buttons to copy input or output text to clipboard.
  7. Listen: Click the speaker icon to hear the translated text (text-to-speech).
  8. Clear: Use the X button to clear the input field.

Supported Languages: English, Spanish, French, German

Three-Tier Translation System:
Tier 1: Local Dictionary

First, the system checks the local dictionary with 1,983 translation entries. This works completely offline!

  • Instant translation
  • No internet required
  • Exact phrase matching
Tier 2: Word-by-Word

If exact match not found, the system translates word-by-word using the local dictionary.

  • Better coverage
  • Handles new combinations
  • Still works offline
Tier 3: API Fallback

As a last resort, uses MyMemory Translation API for real-time translation.

  • Requires internet
  • Handles any text
  • Real-time translation

Translation Status: The footer shows which method was used (Local Dictionary, Word-by-word, or API).

Comprehensive Offline Translation Dictionary

The local dictionary contains 1,983 translation entries covering:

Content Categories:
  • Greetings & Common Phrases
  • Numbers & Dates
  • Days of Week & Months
  • Food & Drinks
  • Family & Relationships
  • Colors & Descriptions
  • Time & Places
  • Actions & Verbs
  • Technology Terms
  • Travel & Transportation
  • Business & Education
  • And much more!
Language Pairs (12 total):
  • English ↔ Spanish
  • English ↔ French
  • English ↔ German
  • Spanish ↔ French
  • Spanish ↔ German
  • French ↔ German

File Location: data/local_dictionary.json

Format: JSON with nested dictionaries for each language pair

Core Features:
  • Real-time Translation: Translates as you type (500ms debounce)
  • Auto Language Detection: Automatically detects source language
  • Offline Support: Works without internet using local dictionary
  • Word-by-Word Translation: Handles phrases not in exact dictionary
  • Character Counter: Shows 0/5000 with limit warning
  • Copy to Clipboard: Easy copy buttons for input/output
  • Text-to-Speech: Listen to translations in target language
  • Language Swap: One-click language direction reversal
Advanced Features:
  • Smart Matching: Handles punctuation and case variations
  • Status Indicators: Shows translation source (Local/API)
  • Error Handling: Graceful fallbacks if translation fails
  • Responsive Design: Works on desktop, tablet, and mobile
  • Google Translate UI: Familiar, user-friendly interface
  • Toast Notifications: Visual feedback for user actions
  • Loading States: Shows spinner during translation
  • Keyboard Shortcuts: Ctrl+Enter to translate

Complete Dataset Structure:
File Format Entries Purpose
train.json / train.tsv JSON / TSV 50 Training dataset with parallel sentences
validation.json / validation.tsv JSON / TSV 5 Validation dataset for model testing
sample_data.json JSON 15 Sample data for preview/demo
local_dictionary.json JSON 1,983 Comprehensive offline translation dictionary

Total Translation Entries: 2,053 (50 + 5 + 15 + 1,983)

Languages Covered: English, Spanish, French, German (4 languages, 12 pairs)

Technology Stack:
Frontend:
  • HTML5: Semantic markup
  • CSS3: Custom styling with Google Translate-inspired design
  • JavaScript (ES6+): Vanilla JS, no frameworks
  • Bootstrap 5: Responsive grid and components
  • Font Awesome 6: Icons
Backend/Data:
  • JSON: Data storage format
  • TSV: Tab-separated values for easy processing
  • Python 3: Data processing scripts
  • MyMemory API: Free translation API fallback
  • Web Speech API: Text-to-speech functionality
Key Functions:
  • loadLocalDictionary() - Loads offline dictionary
  • loadTranslationData() - Loads dataset translations
  • translateText() - Main translation function
  • translateWordByWord() - Word-by-word translation
  • translateWithAPI() - API fallback translation
  • detectLanguage() - Auto language detection
  • handleInput() - Auto-translate on input

Tips for Best Results:
  1. Use Complete Sentences: Full sentences translate better than single words
  2. Check Status Indicator: See if translation came from local dictionary (faster) or API
  3. Offline Mode: Most common phrases work offline - no internet needed!
  4. Language Detection: Use "Detect language" if unsure of source language
  5. Character Limit: Maximum 5,000 characters per translation
  6. Copy & Paste: Easy copy buttons for both input and output
  7. Listen Feature: Use speaker icon to hear pronunciation
  8. Swap Languages: Quickly reverse translation direction
Common Use Cases:
  • Learning new languages
  • Quick phrase translation
  • Understanding foreign text
  • Travel communication
  • Language practice
  • Document translation

About This Project:

Project ID: 25

Category: Text Data

Difficulty: Advanced

Year: 2016

Technologies: TSV, JSON, Transformers, mBERT, mT5

Project Structure:
language-translation/
├── data/              # Dataset files (JSON, TSV)
├── scripts/           # Python processing scripts
├── examples/          # Usage examples
├── index.html         # Main demo page
└── Documentation/     # README, SETUP, etc.
Available Scripts:
  • process_data.py - Process and convert datasets
  • convert_format.py - Convert between TSV and JSON
  • analyze_dataset.py - Analyze dataset statistics
  • download_translation_data.py - Download from public sources
  • build_local_dictionary.py - Build local dictionary

Created by: RSK World

Website: https://rskworld.in

Email: help@rskworld.in

Phone: +91 93305 39277

English
Spanish
Translation

Dataset Preview

ID English Spanish French German

Dataset Features

  • Parallel sentences
  • Multiple language pairs
  • Aligned translations
  • Training and validation sets
  • Ready for translation models