rjust() ljust() with specific encoding (I do a mistake on encoding)

              วันนี้ได้โจทย์มาโดยจะต้องเขียนให้ odoo สามารถ process txt file
แล้วส่งไปผ่านโปรแกรมภายนอกหนึ่งตัวก่อนจะรับไฟล์กลับมา เพื่อเซฟเก็บไว้เป็น attachment file ในระบบ
.
เงื่อนไข

- Content ภายในไฟล์เป็นตัวอักษรหลายรูปแบบ ทั้งไทย อังกฤษ ตัวเลข และอักขระพิเศษ
- Content แบ่งเป็นวรรค ๆ โดยมีการกำหนดช่องว่าง และ word count

ด้วยโจทย์นี้ก็เลยเลือกเอา python funtion rjust กับ ljust มาใช้
โดยขึ้นอยู่กับว่า content ที่เราต้องการจะต้องอยู่ชิดซ้าย หรือชิดขวา
.
.
ใครที่ยังไม่คุ้นเคยกับ rjust และ ljust มันคือ built-in python function
ที่ไว้ใช้จัดการกับการกำหนดคำให้ครบ word count ตามต้องการ
โดย default จะเป็นการเติมช่องว่าง หรือเราสามารถระบุ delimiter ได้ตามต้องการ
 ## ljust
 x = 'apple'
 x.ljust(10)
 print(x, 'PIE')
 # apple     PIE --> ตัวอักษรชิดซ้ายและช่องว่างจนครบ 10
 y = 'apple'
 y.just(10, '-')
 print(y)
 # apple-----
 ## rjust
 z = 'yellow'
 z.rjust(10, '-')
 print(z)
 # ----yellow
 a = 'JORVOR OR JV'
 a.rjust(6)
 print(a)
 # JORVOR

โดย rjust กับ ljust จะทำอยู่ 2 อย่าง กับการกำหนด word count

- เพิ่มตัวอักษรให้ครบจำนวน
- ตัดตัวอักษรในกรณีที่ตัวอักษรเกินกว่าที่ระบุ


ซึ่งมันก็ตัดคำได้ถูกต้องทั้งหมด จนกระทั่ง
Content ในไฟล์ที่ผมเอามา process มีการผสมกันทั้งภาษาอังกฤษ ภาษาไทย และอักขระพิเศษ และการส่งไฟล์ออกไป process ภายนอก จำเป็นจะต้องใช้ encoding เป็น TIS-620 อีกด้วย
ความสนุกก็เลยเริ่มจากตรงนี้
เพราะก่อนหน้านี้ rjust กับ ljust ก็ดูจะใช้ได้ถูกต้องดี แต่พอเจอภาษาที่ผสมกันไปหมด ทีนี้การนับคำก็เลยมีปัญหาเอาซะแล้วล่ะทีนี้
.
.
สุดท้ายก็เลยได้ function ที่ไว้ใช้สำหรับการตัดคำ โดยมีการระบุ encoding เข้าไปก่อนที่จะตัดคำด้วย
ก็เลยได้เป็นเวอร์ชันปรับปรุงของ ljust และ rjust ไปเป็นตามนี้
ljust_limit()

    def ljust_limit(self, value: str, limit: int, special_char: str = " ") -> str:
        """
        Adjust a string to a specific length, truncating or padding to the left as needed.

        Args:
            value (str): The input string.
            limit (int): The maximum length in characters or bytes (for UTF-8).
            special_char (str): The character used for padding.

        Returns:
            str: The adjusted string.
        """
        if value:
            # Encode the string to tis-620 to calculate byte length
            encoded = value.encode('tis-620')

            # Handle cases where the byte length exceeds the limit
            if len(encoded) > limit:
                truncated = encoded[:limit]  # Truncate to the byte limit
                # Decode back to a string, ignoring incomplete characters
                return truncated.decode('tis-620', errors='ignore')
            else:
                # Calculate padding for the tis-620 byte case
                remaining_bytes = limit - len(encoded)
                padding = special_char.encode('tis-620') * remaining_bytes
                # Decode back to a string and pad to the left
                return value + padding.decode('tis-620')
        else:
            # For empty strings, pad with special_char to meet the limit
            return special_char * limit


rjust_limit()

    def rjust_limit(self, value: str, limit: int, special_char: str = "0") -> str:
        """
        Adjust a string to a specific length, truncating or padding as needed.

        Args:
            value (str): The input string.
            limit (int): The maximum length in characters or bytes (for tis-620).
            special_char (str): The character used for padding.

        Returns:
            str: The adjusted string.
        """
        if value:
            # Encode the string to tis-620 to calculate byte length
            encoded = value.encode('tis-620')

            # Handle cases where the byte length exceeds the limit
            if len(encoded) > limit:
                truncated = encoded[:limit]  # Truncate to the byte limit
                # Decode back to a string, ignoring incomplete characters
                return truncated.decode('tis-620', errors='ignore')
            else:
                # Calculate padding for the tis-620 byte case
                remaining_bytes = limit - len(encoded)
                padding = special_char.encode('tis-620') * remaining_bytes
                # Decode back to a string and pad to the right
                return padding.decode('tis-620') + value
        else:
            # For empty strings, pad with special_char to meet the limit
            return special_char * limit



แต่ก็ยังไม่จบแค่นั้น
เพราะตามโจทย์คือผมต้องอ่าน content จากใน txt file —> process —> ส่งออกไปภายนอก —> รับไฟล์กลับมาเซฟ
ซึ่งลำพังแค่เอาตัวอักษรไป process และมีการระบุ encoding ก็ยังพบว่ามันยังผิดอยู่
.
ซึ่งจุดที่พลาดไปอีกจุดคือ ตอนเปิดไฟล์ขึ้นมาอ่านก่อนจะเอา txt ไป process ด้วย
ฟังก์ชัน ljust และ rjust ที่ปรับปรุงแล้ว ซึ่งผมเขียนไว้เป็นแบบนี้

 txt_file = open(filename, "w+")


ก็เลยทำให้ยัง process file นี้ โดย word count ยังผิด ๆ ถูก ๆ อยู่ดี
ก็เลยมาถึงบางอ้อว่า
ถ้าจะ process เป็น TIS-620 ก็อย่าลืมที่จะเปิดไฟล์โดย encoding ให้ถูกด้วย

 txt_file = open(filename, "w+", encoding='tis-620')


สรุปเรื่องที่ได้เรียนรู้

- `ljust` `rjust` ยังทำงานกับ encoding ที่ไว้ใช้สำหรับตัวอักษรไทย ยังไม่ได้ดีสักเท่าไร (หรือไม่ผมก็ใช้วิธีที่อ้อมโลกเกินไป 55555)
- ถ้าจะ process word ด้วย encode อะไร ให้เปิดไฟล์ ด้วย encode นั้น ๆ ด้วย


>_JV