Day 7 - Developing a Neural Machine Translation System from Scratch Part 1



Hello guys,

This is day 7 of my #100DayOfMLCode challenge. Today, I am planning to make a Neural Machine Translation System from German to English and English to German.

Step involved in this project - 

  1. German to English Translation Dataset
  2. Preparing the Text Data
  3. Train Neural Translation Model
  4. Evaluate Neural Translation Model

Python Environment

This project requires Python 3 SciPy environment installed.
You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.
The tutorial also assumes you have NumPy and Matplotlib installed.

German to English Translation Dataset

We will use a dataset of German to English terms used as the basis for flashcards for language learning.
The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.
The page provides a list of many language pairs, and I encourage you to explore other languages:
The dataset we will use in this tutorial is available for download here:
Download the dataset to your current working directory and decompress.

Preparing the Text Data

The next step is to prepare the text data ready for modeling.
Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.
For example, here are some observations I note from reviewing the raw data:
  • There is punctuation.
  • The text contains uppercase and lowercase.
  • There are special characters in the German.
  • There are duplicate phrases in English with different translations in German.
  • The file is ordered by sentence length with very long sentences toward the end of the file.
Did you note anything else that could be important?
Let me know in the comments below.
A good text cleaning procedure may handle some or all of these observations.
Data preparation is divided into two subsections:
  1. Clean Text
  2. Split Text

Thanks, for the first part.
Next Post Previous Post
No Comment
Add Comment
comment url