Day 7 - Developing a Neural Machine Translation System from Scratch Part 1



Hello guys,

This is day 7 of my #100DayOfMLCode challenge. Today, I am planning to make a Neural Machine Translation System from German to English and English to German.

Step involved in this project - 

  1. German to English Translation Dataset
  2. Preparing the Text Data
  3. Train Neural Translation Model
  4. Evaluate Neural Translation Model

Python Environment

This project requires Python 3 SciPy environment installed.
You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.
The tutorial also assumes you have NumPy and Matplotlib installed.

German to English Translation Dataset

We will use a dataset of German to English terms used as the basis for flashcards for language learning.
The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.
The page provides a list of many language pairs, and I encourage you to explore other languages:
The dataset we will use in this tutorial is available for download here:
Download the dataset to your current working directory and decompress.

Preparing the Text Data

The next step is to prepare the text data ready for modeling.
Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.
For example, here are some observations I note from reviewing the raw data:
  • There is punctuation.
  • The text contains uppercase and lowercase.
  • There are special characters in the German.
  • There are duplicate phrases in English with different translations in German.
  • The file is ordered by sentence length with very long sentences toward the end of the file.
Did you note anything else that could be important?
Let me know in the comments below.
A good text cleaning procedure may handle some or all of these observations.
Data preparation is divided into two subsections:
  1. Clean Text
  2. Split Text

Thanks, for the first part.
Next Post Previous Post
1 Comments
  • 博弈教室
    博弈教室 23 October 2025 at 16:03

    SC娛樂城是許多玩家關注,現在線上娛樂城有許多遊戲可以選擇,再加上現在優惠活動非常多,所以說短時間就吸引不少玩家註冊,究竟各位在挑選娛樂城該注意些什麼呢?今天這篇文章帶各位深入了解!



    ►運彩投注
    現在有越來越多民眾選擇再線上娛樂場投注,現金版是什麼?太陽城就提供玩家賺錢機會,地下運彩與台灣運彩的差異處再賠率以及玩法,在博弈娛樂城可以享有業界最高賠率,而且每場比賽都有提供單場給下注,這對於玩家來說勝率可以大大提升,如果還沒有在娛樂城體驗過的一定要試試看!現在有許多最新娛樂城在等著各位。

    運彩玩法作為博彩投資的項目之一,鮮少會有人認真看待策略,於對購買運彩的人來說,一部分網友只是想要參與賽事增加觀賞比賽轉播的娛樂性與刺激,所以在賽事結果的輸贏方面,看法多少都是較隨意的。

    https://wager-sc.tw/

Add Comment
comment url